É possível prever o score de felicidade de um país baseado nos dados de ambos os datasets? (o "Country name" não deve ser considerado para este modelo)
Sim! A cada atributo numérico na base pode ser aplicado um modelo de regressão, e a cada atributo categórico, um modelo de classificação.
O que poderia impedir a aplicação de algum método de estimação de valores seria a má qualidade dos dados. Para provar que é possível ou ressaltar os motivos de ser inviável, será desenvolvida uma análise da qualidade dos dados, e a manipulação dos mesmos numa tentativa de padronização para aplicação dos modelos.
O tratamento dos modelos aplicados como forma de justificativa a esta resposta se encontram no tópico Q1 - Estimativa de Ladder score com base nos atributos
É possível identificar a região do mundo em que um país se encontra através da relação entre as métricas e o score de felicidade obtido? Explique e justifique sua resposta;
Sim! Como a região do mundo é um atributo categórico, podemos aplicar algum modelo de classificação para determiná-la.
Mais uma vez, um impeditivo seria a qualidade/ausência dos dados.
O tratamento dos modelos aplicados como forma de justificativa a esta resposta se encontram no tópico Q2 - Estimativa de Region com base nos atributos
Os dados de 2020 e/ou 2021 sofreram algum impacto devido à pandemia?
Sim!
Durante manipulação dos dados foi gerado um gráfico considerando a presença de dados faltantes (Q3.a) e após a inputação de dados e descarte de casos singulares (Q3.b)
Em ambos os casos podemos observar uma mudança no comportamento dos indicadores, com crescimento ou descrescimento acentuado em quase todos os casos, ou com uma mudança do comportamento que se observava há anos.
import pandas as pd
import os
import numpy as np
import joblib
from pandas_profiling import ProfileReport as pr
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, classification_report
from sklearn.metrics import recall_score, roc_auc_score, r2_score, accuracy_score
from sklearn.metrics import explained_variance_score, max_error, mean_squared_error
from catboost import CatBoostClassifier, CatBoostRegressor, Pool, metrics, cv
import warnings
warnings.filterwarnings('ignore')
Foram adicionados ao conjunto de dados do problema, duas bases de dados com dados regionais dos países:
O dataset regions_un é uma base gerada pela ONU
path = os.getcwd()
input_dfs = dict()
for dirname, _, filenames in os.walk(path+"\\data"):
for filename in filenames:
print(os.path.join(dirname, filename))
input_dfs[filename.split('.')[0]] = pd.read_excel(os.path.join(dirname, filename))
C:\Users\gabri\OneDrive\Documentos\GitHub\-case-felicidade-do-mundo-e-o-desenvolvimento-humano\data\Data_2021.xls C:\Users\gabri\OneDrive\Documentos\GitHub\-case-felicidade-do-mundo-e-o-desenvolvimento-humano\data\HistoricData.xls C:\Users\gabri\OneDrive\Documentos\GitHub\-case-felicidade-do-mundo-e-o-desenvolvimento-humano\data\regions_antwiki.xls C:\Users\gabri\OneDrive\Documentos\GitHub\-case-felicidade-do-mundo-e-o-desenvolvimento-humano\data\regions_un.xls
input_dfs.keys()
dict_keys(['Data_2021', 'HistoricData', 'regions_antwiki', 'regions_un'])
[*input_dfs['Data_2021'].columns]
['Country name', 'Regional indicator', 'Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
[*input_dfs['HistoricData'].columns]
['Country name', 'year', 'Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect']
input_dfs['HistoricData']['Regional indicator'] = None
input_dfs['HistoricData'] = input_dfs['HistoricData'][['Country name','Regional indicator','year','Ladder score','Logged GDP per capita',
'Social support','Healthy life expectancy','Freedom to make life choices',
'Generosity','Perceptions of corruption','Positive affect','Negative affect']]
input_dfs['Data_2021']['Positive affect'] = None
input_dfs['Data_2021']['Negative affect'] = None
input_dfs['Data_2021']['year'] = 2021
input_dfs['Data_2021'] = input_dfs['Data_2021'][['Country name','Regional indicator','year','Ladder score','Logged GDP per capita','Social support',
'Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption',
'Positive affect','Negative affect']]
# Conferindo os dados
input_dfs['Data_2021'].head()
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Finland | Western Europe | 2021 | 7.8421 | 10.775202 | 0.953603 | 72.000000 | 0.949268 | -0.097760 | 0.185846 | None | None |
| 1 | Denmark | Western Europe | 2021 | 7.6195 | 10.933176 | 0.954410 | 72.699753 | 0.945639 | 0.030109 | 0.178838 | None | None |
| 2 | Switzerland | Western Europe | 2021 | 7.5715 | 11.117368 | 0.941742 | 74.400101 | 0.918788 | 0.024629 | 0.291698 | None | None |
| 3 | Iceland | Western Europe | 2021 | 7.5539 | 10.877768 | 0.982938 | 73.000000 | 0.955123 | 0.160274 | 0.672865 | None | None |
| 4 | Netherlands | Western Europe | 2021 | 7.4640 | 10.931812 | 0.941601 | 72.400116 | 0.913116 | 0.175404 | 0.337938 | None | None |
# Conferindo os dados
input_dfs['HistoricData'].head()
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | None | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | None | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | None | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Afghanistan | None | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Afghanistan | None | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
input_dfs['Data_2021'].shape
(149, 12)
input_dfs['HistoricData'].shape
(1949, 12)
df = input_dfs['Data_2021'].append(input_dfs['HistoricData'])
df.sort_values(by=['Country name','year'], ascending=True, inplace=True)
df.index = range(len(df.index))
# shape final é a soma dos shapes separados, append ok
df.shape
(2098, 12)
# Remoção de dados duplicados, se houver
df.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=False)
df.shape
(2098, 12)
df.head()
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | None | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | None | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | None | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Afghanistan | None | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Afghanistan | None | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
# todos os dados de 2021 pra frente vêm de 'Data_2021'
df[df.year == 2021].shape == input_dfs['Data_2021'].shape
True
# todos os dados anteriores a 2021 vêm de 'Data_2021'
df[df.year != 2021].shape == input_dfs['HistoricData'].shape
True
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2098 entries, 0 to 2097 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2098 non-null object 1 Regional indicator 149 non-null object 2 year 2098 non-null int64 3 Ladder score 2098 non-null float64 4 Logged GDP per capita 2062 non-null float64 5 Social support 2085 non-null float64 6 Healthy life expectancy 2043 non-null float64 7 Freedom to make life choices 2066 non-null float64 8 Generosity 2009 non-null float64 9 Perceptions of corruption 1988 non-null float64 10 Positive affect 1927 non-null object 11 Negative affect 1933 non-null object dtypes: float64(7), int64(1), object(4) memory usage: 213.1+ KB
df['Positive affect'] = df['Positive affect'].astype(float)
df['Negative affect'] = df['Negative affect'].astype(float)
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Country name | 2098 | 166 | Zimbabwe | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Regional indicator | 149 | 10 | Sub-Saharan Africa | 36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| year | 2098.0 | NaN | NaN | NaN | 2013.768827 | 4.486449 | 2005.0 | 2010.0 | 2014.0 | 2018.0 | 2021.0 |
| Ladder score | 2098.0 | NaN | NaN | NaN | 5.471403 | 1.112682 | 2.375092 | 4.652504 | 5.391887 | 6.282982 | 8.018934 |
| Logged GDP per capita | 2062.0 | NaN | NaN | NaN | 9.373065 | 1.154252 | 6.635322 | 8.470213 | 9.462173 | 10.360714 | 11.648169 |
| Social support | 2085.0 | NaN | NaN | NaN | 0.812709 | 0.118202 | 0.290184 | 0.749633 | 0.834716 | 0.90529 | 0.987343 |
| Healthy life expectancy | 2043.0 | NaN | NaN | NaN | 63.478503 | 7.468781 | 32.299999 | 58.7045 | 65.279999 | 68.660004 | 77.099998 |
| Freedom to make life choices | 2066.0 | NaN | NaN | NaN | 0.746101 | 0.140774 | 0.257534 | 0.652307 | 0.766931 | 0.859147 | 0.985178 |
| Generosity | 2009.0 | NaN | NaN | NaN | -0.001023 | 0.161405 | -0.33504 | -0.115171 | -0.026638 | 0.089205 | 0.698099 |
| Perceptions of corruption | 1988.0 | NaN | NaN | NaN | 0.745639 | 0.186267 | 0.035198 | 0.688764 | 0.800729 | 0.869042 | 0.983276 |
| Positive affect | 1927.0 | NaN | NaN | NaN | 0.709998 | 0.107106 | 0.32169 | 0.625373 | 0.722391 | 0.799276 | 0.943621 |
| Negative affect | 1933.0 | NaN | NaN | NaN | 0.268552 | 0.085176 | 0.082737 | 0.206403 | 0.258117 | 0.319716 | 0.70459 |
for c in df:
print(f"[{c}]:\nTotal de Nulos:{df[c].isna().sum()}\tTotal percentual de Nulos: {round(100*df[c].isna().sum()/df.shape[0],3)}%\n")
[Country name]: Total de Nulos:0 Total percentual de Nulos: 0.0% [Regional indicator]: Total de Nulos:1949 Total percentual de Nulos: 92.898% [year]: Total de Nulos:0 Total percentual de Nulos: 0.0% [Ladder score]: Total de Nulos:0 Total percentual de Nulos: 0.0% [Logged GDP per capita]: Total de Nulos:36 Total percentual de Nulos: 1.716% [Social support]: Total de Nulos:13 Total percentual de Nulos: 0.62% [Healthy life expectancy]: Total de Nulos:55 Total percentual de Nulos: 2.622% [Freedom to make life choices]: Total de Nulos:32 Total percentual de Nulos: 1.525% [Generosity]: Total de Nulos:89 Total percentual de Nulos: 4.242% [Perceptions of corruption]: Total de Nulos:110 Total percentual de Nulos: 5.243% [Positive affect]: Total de Nulos:171 Total percentual de Nulos: 8.151% [Negative affect]: Total de Nulos:165 Total percentual de Nulos: 7.865%
#na_ano = df.groupby(by=['Country name','year'],as_index=False).agg('mean')
pais_ano = df.groupby(by=['Country name','year']).agg('mean')
ao_ano_na = df.groupby(by='year',as_index=False).agg('mean')
na_scaler = MinMaxScaler()
sc_ano_na = pd.DataFrame(na_scaler.fit_transform(ao_ano_na.drop(columns='year')),columns=ao_ano_na.drop(columns='year').columns)
sc_ano_na['year'] = ao_ano_na['year']
sc_ano_na = sc_ano_na[[*ao_ano_na.columns]]
print(f"Sobre o scaler dos dados com na:\nFeatures: {na_scaler.feature_names_in_}\nValores Máximos: {na_scaler.data_max_}\nValores Mínimos: {na_scaler.data_min_}\nFaixa de valores: {na_scaler.feature_range}\nParâmetros gerais: {na_scaler.get_params()}")
Sobre o scaler dos dados com na:
Features: ['Ladder score' 'Logged GDP per capita' 'Social support'
'Healthy life expectancy' 'Freedom to make life choices' 'Generosity'
'Perceptions of corruption' 'Positive affect' 'Negative affect']
Valores Máximos: [ 6.44616427 10.11863786 0.89736686 67.0995636 0.82961793 0.25623003
0.79206881 0.74856599 0.29270757]
Valores Mínimos: [ 5.19693529e+00 9.04429773e+00 7.84400843e-01 6.01475000e+01
6.87328704e-01 -2.31054730e-02 7.07697230e-01 7.01571426e-01
2.40694924e-01]
Faixa de valores: (0, 1)
Parâmetros gerais: {'clip': False, 'copy': True, 'feature_range': (0, 1)}
fig = go.Figure()
cols = [*sc_ano_na.columns]
cols.remove('year')
for column in cols:
fig.add_trace(go.Scatter( x = sc_ano_na.year, y = sc_ano_na[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores Escalonados Globais por Ano [com dados faltantes]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_global_norm_faltantes.html')
br_ano_na = df[df['Country name']=='Brazil'].groupby(by='year',as_index=False).agg('mean')
cols = [*br_ano_na.columns]
cols.remove('year')
fig = go.Figure()
for column in cols:
fig.add_trace(go.Scatter( x = br_ano_na.year, y = br_ano_na[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores por Ano [Brasil, não-normalizados]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_br_raw_faltantes.html')
casos_singulares = []
uma_amostra = []
for p in df['Country name'].unique():
df_pais = df[df['Country name']==p]
if (df_pais.drop(columns='Regional indicator').isna().sum() >= df_pais.shape[0]-1).any():
print(f"País: {p}\nAmostras: {df_pais.shape[0]}\tTotal de faltantes: {df_pais.isna().sum().sum()}")
display(pd.DataFrame(df_pais.drop(columns='Regional indicator').isna().sum()).T)
casos_singulares.append(p)
if df_pais.shape[0] == 1:
uma_amostra.append(p)
País: China Amostras: 16 Total de faltantes: 38
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 15 | 1 | 1 |
País: Cuba Amostras: 1 Total de faltantes: 4
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
País: Guyana Amostras: 1 Total de faltantes: 1
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
País: Hong Kong S.A.R. of China Amostras: 12 Total de faltantes: 26
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 11 | 0 | 1 | 0 | 1 | 1 |
País: Kosovo Amostras: 15 Total de faltantes: 34
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 14 | 1 | 1 | 0 | 2 | 1 |
País: Maldives Amostras: 2 Total de faltantes: 6
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 2 |
País: North Cyprus Amostras: 8 Total de faltantes: 30
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 7 | 0 | 7 | 0 | 7 | 0 | 1 | 1 |
País: Oman Amostras: 1 Total de faltantes: 4
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
País: Qatar Amostras: 5 Total de faltantes: 18
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 4 | 2 | 2 |
País: Somalia Amostras: 3 Total de faltantes: 9
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |
País: Somaliland region Amostras: 4 Total de faltantes: 16
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 4 | 0 | 4 | 0 | 4 | 0 | 0 | 0 |
País: South Sudan Amostras: 4 Total de faltantes: 12
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
País: Suriname Amostras: 1 Total de faltantes: 1
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
País: Turkmenistan Amostras: 11 Total de faltantes: 24
| Country name | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 10 | 1 | 1 |
casos_singulares = list(set(casos_singulares)-set(uma_amostra))
print(f"Casos singulares de dados faltantes: {len(casos_singulares)}\n{[*casos_singulares]}")
Casos singulares de dados faltantes: 10 ['Hong Kong S.A.R. of China', 'China', 'Turkmenistan', 'South Sudan', 'Somaliland region', 'Qatar', 'Kosovo', 'North Cyprus', 'Maldives', 'Somalia']
df[df['Country name'].isin(uma_amostra)]
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 454 | Cuba | None | 2006 | 5.417869 | NaN | 0.969595 | 68.440002 | 0.281458 | NaN | NaN | 0.646712 | 0.276602 |
| 723 | Guyana | None | 2007 | 5.992826 | 8.773289 | 0.848765 | 57.259998 | 0.694006 | 0.110037 | 0.835569 | 0.767541 | 0.296420 |
| 1414 | Oman | None | 2011 | 6.852982 | 10.382462 | NaN | 65.500000 | 0.916293 | 0.024908 | NaN | NaN | 0.295164 |
| 1759 | Suriname | None | 2012 | 6.269287 | 9.797085 | 0.797262 | 62.240002 | 0.885488 | -0.077173 | 0.751283 | 0.764223 | 0.250365 |
df[df['Country name']=='Kosovo']
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 973 | Kosovo | None | 2007 | 5.103906 | 8.927753 | 0.847812 | NaN | 0.381364 | 0.143901 | 0.894462 | 0.654866 | 0.236699 |
| 974 | Kosovo | None | 2008 | 5.521660 | 8.980872 | 0.883843 | NaN | NaN | 0.090464 | 0.849059 | NaN | 0.317828 |
| 975 | Kosovo | None | 2009 | 5.891433 | 9.008162 | 0.830427 | NaN | 0.506415 | 0.200504 | 0.967839 | 0.597583 | 0.168830 |
| 976 | Kosovo | None | 2010 | 5.176601 | 9.032693 | 0.707959 | NaN | 0.451444 | 0.169696 | 0.967272 | 0.695178 | 0.117717 |
| 977 | Kosovo | None | 2011 | 4.859502 | 9.066925 | 0.759102 | NaN | 0.588979 | 0.003699 | 0.919212 | 0.695966 | 0.124438 |
| 978 | Kosovo | None | 2012 | 5.639588 | 9.085688 | 0.757147 | NaN | 0.635793 | 0.027182 | 0.949651 | 0.595572 | 0.099630 |
| 979 | Kosovo | None | 2013 | 6.125758 | 9.113430 | 0.720750 | NaN | 0.568463 | 0.114904 | 0.935095 | 0.691511 | 0.202731 |
| 980 | Kosovo | None | 2014 | 5.000375 | 9.128522 | 0.705632 | NaN | 0.441391 | 0.012095 | 0.775201 | 0.636128 | 0.205950 |
| 981 | Kosovo | None | 2015 | 5.077461 | 9.182307 | 0.805271 | NaN | 0.561048 | 0.180851 | 0.850647 | 0.753090 | 0.179989 |
| 982 | Kosovo | None | 2016 | 5.759412 | 9.228177 | 0.823803 | NaN | 0.827399 | 0.124869 | 0.940898 | 0.703887 | 0.149607 |
| 983 | Kosovo | None | 2017 | 6.149200 | 9.262030 | 0.792087 | NaN | 0.857677 | 0.117175 | 0.925192 | 0.738436 | 0.185879 |
| 984 | Kosovo | None | 2018 | 6.391826 | 9.296085 | 0.822407 | NaN | 0.889737 | 0.268795 | 0.922078 | 0.778271 | 0.170248 |
| 985 | Kosovo | None | 2019 | 6.425144 | 9.338535 | 0.842511 | NaN | 0.841190 | 0.246990 | 0.920297 | 0.748522 | 0.140792 |
| 986 | Kosovo | None | 2020 | 6.294414 | NaN | 0.792374 | NaN | 0.879838 | NaN | 0.909894 | 0.726240 | 0.201458 |
| 987 | Kosovo | Central and Eastern Europe | 2021 | 6.372000 | 9.318236 | 0.820958 | 63.812744 | 0.868972 | 0.257417 | 0.917488 | NaN | NaN |
df[df['Country name']=='China']
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 367 | China | None | 2006 | 4.560495 | 8.696120 | 0.747011 | 66.879997 | NaN | NaN | NaN | 0.809295 | 0.169580 |
| 368 | China | None | 2007 | 4.862862 | 8.823954 | 0.810852 | 67.059998 | NaN | -0.176243 | NaN | 0.817485 | 0.158614 |
| 369 | China | None | 2008 | 4.846295 | 8.910992 | 0.748287 | 67.239998 | 0.853072 | -0.092472 | NaN | 0.817443 | 0.146963 |
| 370 | China | None | 2009 | 4.454361 | 8.995857 | 0.798034 | 67.419998 | 0.771143 | -0.160481 | NaN | 0.785806 | 0.161650 |
| 371 | China | None | 2010 | 4.652737 | 9.092104 | 0.767753 | 67.599998 | 0.804794 | -0.133318 | NaN | 0.765265 | 0.158100 |
| 372 | China | None | 2011 | 5.037208 | 9.178532 | 0.787171 | 67.760002 | 0.824162 | -0.186383 | NaN | 0.820074 | 0.133503 |
| 373 | China | None | 2012 | 5.094917 | 9.249320 | 0.787818 | 67.919998 | 0.808255 | -0.184676 | NaN | 0.820785 | 0.158703 |
| 374 | China | None | 2013 | 5.241090 | 9.319200 | 0.777896 | 68.080002 | 0.804724 | -0.157777 | NaN | 0.836431 | 0.142211 |
| 375 | China | None | 2014 | 5.195619 | 9.385755 | 0.820366 | 68.239998 | NaN | -0.216772 | NaN | 0.853975 | 0.111518 |
| 376 | China | None | 2015 | 5.303878 | 9.448723 | 0.793734 | 68.400002 | NaN | -0.244435 | NaN | 0.808911 | 0.171315 |
| 377 | China | None | 2016 | 5.324956 | 9.509552 | 0.741703 | 68.699997 | NaN | -0.227522 | NaN | 0.826144 | 0.145625 |
| 378 | China | None | 2017 | 5.099061 | 9.571116 | 0.772033 | 69.000000 | 0.877618 | -0.174832 | NaN | 0.821097 | 0.214005 |
| 379 | China | None | 2018 | 5.131434 | 9.631892 | 0.787605 | 69.300003 | 0.895378 | -0.158510 | NaN | 0.855784 | 0.189640 |
| 380 | China | None | 2019 | 5.144120 | 9.687612 | 0.821936 | 69.599998 | 0.927356 | -0.173036 | NaN | 0.890780 | 0.146512 |
| 381 | China | None | 2020 | 5.771065 | 9.701755 | 0.808334 | 69.900002 | 0.891123 | -0.103214 | NaN | 0.789345 | 0.244918 |
| 382 | China | East Asia | 2021 | 5.339100 | 9.673172 | 0.810829 | 69.593407 | 0.904293 | -0.145908 | 0.755389 | NaN | NaN |
len(df['Regional indicator'].unique())
11
df['Regional indicator'].unique()
array([None, 'South Asia', 'Central and Eastern Europe',
'Middle East and North Africa', 'Latin America and Caribbean',
'Commonwealth of Independent States', 'North America and ANZ',
'Western Europe', 'Sub-Saharan Africa', 'Southeast Asia',
'East Asia'], dtype=object)
# países por região
for ri in df['Regional indicator'].unique():
print(f"{ri}: { len(df[df['Regional indicator'] == ri]['Country name'].unique())}, {df[df['Regional indicator'] == ri]['Country name'].unique()}\n")
None: 0, [] South Asia: 7, ['Afghanistan' 'Bangladesh' 'India' 'Maldives' 'Nepal' 'Pakistan' 'Sri Lanka'] Central and Eastern Europe: 17, ['Albania' 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Czech Republic' 'Estonia' 'Hungary' 'Kosovo' 'Latvia' 'Lithuania' 'Montenegro' 'North Macedonia' 'Poland' 'Romania' 'Serbia' 'Slovakia' 'Slovenia'] Middle East and North Africa: 17, ['Algeria' 'Bahrain' 'Egypt' 'Iran' 'Iraq' 'Israel' 'Jordan' 'Kuwait' 'Lebanon' 'Libya' 'Morocco' 'Palestinian Territories' 'Saudi Arabia' 'Tunisia' 'Turkey' 'United Arab Emirates' 'Yemen'] Latin America and Caribbean: 20, ['Argentina' 'Bolivia' 'Brazil' 'Chile' 'Colombia' 'Costa Rica' 'Dominican Republic' 'Ecuador' 'El Salvador' 'Guatemala' 'Haiti' 'Honduras' 'Jamaica' 'Mexico' 'Nicaragua' 'Panama' 'Paraguay' 'Peru' 'Uruguay' 'Venezuela'] Commonwealth of Independent States: 12, ['Armenia' 'Azerbaijan' 'Belarus' 'Georgia' 'Kazakhstan' 'Kyrgyzstan' 'Moldova' 'Russia' 'Tajikistan' 'Turkmenistan' 'Ukraine' 'Uzbekistan'] North America and ANZ: 4, ['Australia' 'Canada' 'New Zealand' 'United States'] Western Europe: 21, ['Austria' 'Belgium' 'Cyprus' 'Denmark' 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Ireland' 'Italy' 'Luxembourg' 'Malta' 'Netherlands' 'North Cyprus' 'Norway' 'Portugal' 'Spain' 'Sweden' 'Switzerland' 'United Kingdom'] Sub-Saharan Africa: 36, ['Benin' 'Botswana' 'Burkina Faso' 'Burundi' 'Cameroon' 'Chad' 'Comoros' 'Congo (Brazzaville)' 'Ethiopia' 'Gabon' 'Gambia' 'Ghana' 'Guinea' 'Ivory Coast' 'Kenya' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali' 'Mauritania' 'Mauritius' 'Mozambique' 'Namibia' 'Niger' 'Nigeria' 'Rwanda' 'Senegal' 'Sierra Leone' 'South Africa' 'Swaziland' 'Tanzania' 'Togo' 'Uganda' 'Zambia' 'Zimbabwe'] Southeast Asia: 9, ['Cambodia' 'Indonesia' 'Laos' 'Malaysia' 'Myanmar' 'Philippines' 'Singapore' 'Thailand' 'Vietnam'] East Asia: 6, ['China' 'Hong Kong S.A.R. of China' 'Japan' 'Mongolia' 'South Korea' 'Taiwan Province of China']
df[ (df['Country name'] == 'Afghanistan') ]
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | None | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | None | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | None | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Afghanistan | None | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Afghanistan | None | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
| 5 | Afghanistan | None | 2013 | 3.572100 | 7.725029 | 0.483552 | 52.560001 | 0.577955 | 0.061148 | 0.823204 | 0.620585 | 0.273328 |
| 6 | Afghanistan | None | 2014 | 3.130896 | 7.718354 | 0.525568 | 52.880001 | 0.508514 | 0.104013 | 0.871242 | 0.531691 | 0.374861 |
| 7 | Afghanistan | None | 2015 | 3.982855 | 7.701992 | 0.528597 | 53.200001 | 0.388928 | 0.079864 | 0.880638 | 0.553553 | 0.339276 |
| 8 | Afghanistan | None | 2016 | 4.220169 | 7.696560 | 0.559072 | 53.000000 | 0.522566 | 0.042265 | 0.793246 | 0.564953 | 0.348332 |
| 9 | Afghanistan | None | 2017 | 2.661718 | 7.697381 | 0.490880 | 52.799999 | 0.427011 | -0.121303 | 0.954393 | 0.496349 | 0.371326 |
| 10 | Afghanistan | None | 2018 | 2.694303 | 7.691767 | 0.507516 | 52.599998 | 0.373536 | -0.093828 | 0.927606 | 0.424125 | 0.404904 |
| 11 | Afghanistan | None | 2019 | 2.375092 | 7.697248 | 0.419973 | 52.400002 | 0.393656 | -0.108459 | 0.923849 | 0.351387 | 0.502474 |
| 12 | Afghanistan | South Asia | 2021 | 2.522900 | 7.694710 | 0.462596 | 52.492615 | 0.381749 | -0.101684 | 0.924338 | NaN | NaN |
df[ (df['Country name'] == 'Austria') ]
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 86 | Austria | None | 2006 | 7.122211 | 10.841940 | 0.936350 | 70.760002 | 0.941382 | 0.302386 | 0.490111 | 0.823105 | 0.173812 |
| 87 | Austria | None | 2008 | 7.180954 | 10.886662 | 0.934593 | 71.080002 | 0.879069 | 0.291309 | 0.613625 | 0.832170 | 0.173195 |
| 88 | Austria | None | 2010 | 7.302679 | 10.861471 | 0.914193 | 71.400002 | 0.895980 | 0.130891 | 0.546145 | 0.814719 | 0.155793 |
| 89 | Austria | None | 2011 | 7.470513 | 10.886909 | 0.944157 | 71.540001 | 0.939356 | 0.131578 | 0.702721 | 0.789471 | 0.145238 |
| 90 | Austria | None | 2012 | 7.400689 | 10.889132 | 0.945142 | 71.680000 | 0.919704 | 0.117804 | 0.770586 | 0.822248 | 0.156675 |
| 91 | Austria | None | 2013 | 7.498803 | 10.883492 | 0.949809 | 71.820000 | 0.921734 | 0.168248 | 0.678937 | 0.787313 | 0.162603 |
| 92 | Austria | None | 2014 | 6.950000 | 10.882268 | 0.898920 | 71.959999 | 0.885027 | 0.117607 | 0.566931 | 0.779693 | 0.170150 |
| 93 | Austria | None | 2015 | 7.076447 | 10.881152 | 0.928110 | 72.099998 | 0.900305 | 0.098893 | 0.557480 | 0.798263 | 0.164469 |
| 94 | Austria | None | 2016 | 7.048072 | 10.890950 | 0.926319 | 72.400002 | 0.888514 | 0.079749 | 0.523641 | 0.755903 | 0.197424 |
| 95 | Austria | None | 2017 | 7.293728 | 10.908466 | 0.906218 | 72.699997 | 0.890031 | 0.133064 | 0.518304 | 0.747569 | 0.180269 |
| 96 | Austria | None | 2018 | 7.396002 | 10.927505 | 0.911668 | 73.000000 | 0.904112 | 0.053470 | 0.523061 | 0.752350 | 0.226059 |
| 97 | Austria | None | 2019 | 7.195361 | 10.939381 | 0.964489 | 73.300003 | 0.903428 | 0.059686 | 0.457089 | 0.774459 | 0.205170 |
| 98 | Austria | None | 2020 | 7.213489 | 10.851118 | 0.924831 | 73.599998 | 0.911910 | 0.011032 | 0.463830 | 0.769317 | 0.206500 |
| 99 | Austria | Western Europe | 2021 | 7.267800 | 10.906316 | 0.934176 | 73.299721 | 0.907691 | 0.041568 | 0.481378 | NaN | NaN |
# preenchendo campos com dados já fornecidos
for p in df['Country name'].unique():
regs = [ *df[ df['Country name'] == p ]['Regional indicator'].unique() ]
regs.remove(None)
if len(regs) == 1:
df.loc[df[ df['Country name'] == p ].index,['Regional indicator']] = regs[0]
df[ (df['Country name'] == 'Afghanistan') ]
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | South Asia | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | South Asia | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | South Asia | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Afghanistan | South Asia | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Afghanistan | South Asia | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
| 5 | Afghanistan | South Asia | 2013 | 3.572100 | 7.725029 | 0.483552 | 52.560001 | 0.577955 | 0.061148 | 0.823204 | 0.620585 | 0.273328 |
| 6 | Afghanistan | South Asia | 2014 | 3.130896 | 7.718354 | 0.525568 | 52.880001 | 0.508514 | 0.104013 | 0.871242 | 0.531691 | 0.374861 |
| 7 | Afghanistan | South Asia | 2015 | 3.982855 | 7.701992 | 0.528597 | 53.200001 | 0.388928 | 0.079864 | 0.880638 | 0.553553 | 0.339276 |
| 8 | Afghanistan | South Asia | 2016 | 4.220169 | 7.696560 | 0.559072 | 53.000000 | 0.522566 | 0.042265 | 0.793246 | 0.564953 | 0.348332 |
| 9 | Afghanistan | South Asia | 2017 | 2.661718 | 7.697381 | 0.490880 | 52.799999 | 0.427011 | -0.121303 | 0.954393 | 0.496349 | 0.371326 |
| 10 | Afghanistan | South Asia | 2018 | 2.694303 | 7.691767 | 0.507516 | 52.599998 | 0.373536 | -0.093828 | 0.927606 | 0.424125 | 0.404904 |
| 11 | Afghanistan | South Asia | 2019 | 2.375092 | 7.697248 | 0.419973 | 52.400002 | 0.393656 | -0.108459 | 0.923849 | 0.351387 | 0.502474 |
| 12 | Afghanistan | South Asia | 2021 | 2.522900 | 7.694710 | 0.462596 | 52.492615 | 0.381749 | -0.101684 | 0.924338 | NaN | NaN |
df[ df['Country name'] == 'Austria' ]
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 86 | Austria | Western Europe | 2006 | 7.122211 | 10.841940 | 0.936350 | 70.760002 | 0.941382 | 0.302386 | 0.490111 | 0.823105 | 0.173812 |
| 87 | Austria | Western Europe | 2008 | 7.180954 | 10.886662 | 0.934593 | 71.080002 | 0.879069 | 0.291309 | 0.613625 | 0.832170 | 0.173195 |
| 88 | Austria | Western Europe | 2010 | 7.302679 | 10.861471 | 0.914193 | 71.400002 | 0.895980 | 0.130891 | 0.546145 | 0.814719 | 0.155793 |
| 89 | Austria | Western Europe | 2011 | 7.470513 | 10.886909 | 0.944157 | 71.540001 | 0.939356 | 0.131578 | 0.702721 | 0.789471 | 0.145238 |
| 90 | Austria | Western Europe | 2012 | 7.400689 | 10.889132 | 0.945142 | 71.680000 | 0.919704 | 0.117804 | 0.770586 | 0.822248 | 0.156675 |
| 91 | Austria | Western Europe | 2013 | 7.498803 | 10.883492 | 0.949809 | 71.820000 | 0.921734 | 0.168248 | 0.678937 | 0.787313 | 0.162603 |
| 92 | Austria | Western Europe | 2014 | 6.950000 | 10.882268 | 0.898920 | 71.959999 | 0.885027 | 0.117607 | 0.566931 | 0.779693 | 0.170150 |
| 93 | Austria | Western Europe | 2015 | 7.076447 | 10.881152 | 0.928110 | 72.099998 | 0.900305 | 0.098893 | 0.557480 | 0.798263 | 0.164469 |
| 94 | Austria | Western Europe | 2016 | 7.048072 | 10.890950 | 0.926319 | 72.400002 | 0.888514 | 0.079749 | 0.523641 | 0.755903 | 0.197424 |
| 95 | Austria | Western Europe | 2017 | 7.293728 | 10.908466 | 0.906218 | 72.699997 | 0.890031 | 0.133064 | 0.518304 | 0.747569 | 0.180269 |
| 96 | Austria | Western Europe | 2018 | 7.396002 | 10.927505 | 0.911668 | 73.000000 | 0.904112 | 0.053470 | 0.523061 | 0.752350 | 0.226059 |
| 97 | Austria | Western Europe | 2019 | 7.195361 | 10.939381 | 0.964489 | 73.300003 | 0.903428 | 0.059686 | 0.457089 | 0.774459 | 0.205170 |
| 98 | Austria | Western Europe | 2020 | 7.213489 | 10.851118 | 0.924831 | 73.599998 | 0.911910 | 0.011032 | 0.463830 | 0.769317 | 0.206500 |
| 99 | Austria | Western Europe | 2021 | 7.267800 | 10.906316 | 0.934176 | 73.299721 | 0.907691 | 0.041568 | 0.481378 | NaN | NaN |
df['Regional indicator'].isna().sum()
63
sem_regiao = [*df[df['Regional indicator'].isna()]['Country name'].unique()]
sem_regiao
['Angola', 'Belize', 'Bhutan', 'Central African Republic', 'Congo (Kinshasa)', 'Cuba', 'Djibouti', 'Guyana', 'Oman', 'Qatar', 'Somalia', 'Somaliland region', 'South Sudan', 'Sudan', 'Suriname', 'Syria', 'Trinidad and Tobago']
# colunas do dataset da Onu
input_dfs['regions_un'].columns
Index(['Country or area ', 'Major area ', 'Region ', 'Development region'], dtype='object')
# Países no dataset da Onu
input_dfs['regions_un']['Country or area '].unique()[:5]
array(['Afghanistan ', 'Albania ', 'Algeria ', 'American Samoa ',
'Andorra '], dtype=object)
# removendo espaços vazios das células
for c in input_dfs['regions_un']:
input_dfs['regions_un'][c] = [ x.strip() for x in input_dfs['regions_un'][c] ]
# removendo espaços vazios dos nomes das colunas
input_dfs['regions_un'].columns = [ x.strip() for x in input_dfs['regions_un'].columns ]
input_dfs['regions_un']['Country or area'].unique()[:5]
array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
dtype=object)
df['Regional_indicator_consultado_Major'] = None
df['Regional_indicator_consultado'] = None
df = df[['Country name','Regional indicator','Regional_indicator_consultado_Major','Regional_indicator_consultado','year',
'Ladder score','Logged GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices',
'Generosity','Perceptions of corruption','Positive affect','Negative affect']]
# da ONU
input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == 'Brazil']
| Country or area | Major area | Region | Development region | |
|---|---|---|---|---|
| 27 | Brazil | Latin America and the Caribbean | South America | Less developed regions |
# dos Dados fornecidos
df[df['Country name'] == 'Brazil'].head(1)
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 234 | Brazil | Latin America and Caribbean | None | None | 2005 | 6.636771 | 9.438417 | 0.882923 | 63.299999 | 0.882186 | NaN | 0.744994 | 0.818337 | 0.30178 |
# prenchimento das colunas auxiliares com dados da ONU
nao_encontrados = []
for p in df['Country name'].unique():
if p in input_dfs['regions_un']['Country or area'].unique():
df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado_Major']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == p]['Major area'].values[0]
df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == p]['Region'].values[0]
else:
nao_encontrados.append(p)
# novo formato
df[df['Country name'] == 'Brazil'].head(1)
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 234 | Brazil | Latin America and Caribbean | Latin America and the Caribbean | South America | 2005 | 6.636771 | 9.438417 | 0.882923 | 63.299999 | 0.882186 | NaN | 0.744994 | 0.818337 | 0.30178 |
df[ df['Country name'] == 'Oman' ]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1414 | Oman | None | Asia | Western Asia | 2011 | 6.852982 | 10.382462 | NaN | 65.5 | 0.916293 | 0.024908 | NaN | NaN | 0.295164 |
df['Regional indicator'].value_counts()
Sub-Saharan Africa 426 Latin America and Caribbean 299 Western Europe 292 Central and Eastern Europe 242 Middle East and North Africa 228 Commonwealth of Independent States 182 Southeast Asia 125 South Asia 91 East Asia 88 North America and ANZ 62 Name: Regional indicator, dtype: int64
df['Regional_indicator_consultado'].value_counts()
Western Asia 209 Southern Europe 173 Western Africa 162 Eastern Africa 148 Northern Europe 127 South America 127 Eastern Europe 114 Central America 109 South-Eastern Asia 100 Western Europe 99 Southern Asia 94 Central Asia 73 Northern Africa 58 Middle Africa 50 Eastern Asia 46 Southern Africa 45 Caribbean 41 Australia and New Zealand 30 Northern America 16 Name: Regional_indicator_consultado, dtype: int64
df['Regional_indicator_consultado_Major'].value_counts()
Asia 522 Europe 513 Africa 463 Latin America and the Caribbean 277 Oceania 30 Northern America 16 Name: Regional_indicator_consultado_Major, dtype: int64
for mr in df['Regional_indicator_consultado_Major'].unique():
print(df[df['Regional_indicator_consultado_Major'] == mr][['Regional_indicator_consultado_Major','Regional indicator','Regional_indicator_consultado']].value_counts())
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Asia Middle East and North Africa Western Asia 143
Southeast Asia South-Eastern Asia 100
South Asia Southern Asia 91
Commonwealth of Independent States Central Asia 73
Western Asia 46
East Asia Eastern Asia 46
Western Europe Western Asia 14
dtype: int64
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Europe Central and Eastern Europe Southern Europe 99
Western Europe Western Europe 99
Central and Eastern Europe Eastern Europe 83
Western Europe Northern Europe 81
Southern Europe 74
Central and Eastern Europe Northern Europe 46
Commonwealth of Independent States Eastern Europe 31
dtype: int64
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Africa Sub-Saharan Africa Western Africa 162
Eastern Africa 141
Middle East and North Africa Northern Africa 49
Sub-Saharan Africa Southern Africa 45
Middle Africa 41
dtype: int64
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Latin America and the Caribbean Latin America and Caribbean South America 125
Central America 107
Caribbean 35
dtype: int64
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Oceania North America and ANZ Australia and New Zealand 30
dtype: int64
Series([], dtype: int64)
Regional_indicator_consultado_Major Regional indicator Regional_indicator_consultado
Northern America North America and ANZ Northern America 16
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2098 entries, 0 to 2097 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2098 non-null object 1 Regional indicator 2035 non-null object 2 Regional_indicator_consultado_Major 1821 non-null object 3 Regional_indicator_consultado 1821 non-null object 4 year 2098 non-null int64 5 Ladder score 2098 non-null float64 6 Logged GDP per capita 2062 non-null float64 7 Social support 2085 non-null float64 8 Healthy life expectancy 2043 non-null float64 9 Freedom to make life choices 2066 non-null float64 10 Generosity 2009 non-null float64 11 Perceptions of corruption 1988 non-null float64 12 Positive affect 1927 non-null float64 13 Negative affect 1933 non-null float64 dtypes: float64(9), int64(1), object(4) memory usage: 310.4+ KB
input_dfs['regions_un']['Country or area'].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan',
'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
'Botswana', 'Brazil', 'British Virgin Islands',
'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi',
'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands',
'Central African Republic', 'Chad', 'Channel Islands', 'Chile',
'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Cyprus',
'Czech Republic', "Democratic People's Republic of Korea",
'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
'Ethiopia', 'Faeroe Islands', 'Falkland Islands (Malvinas)',
'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia',
'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar',
'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam',
'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti',
'Holy See', 'Honduras',
'China, Hong Kong Special Administrative Region', 'Hungary',
'Iceland', 'India', 'Indonesia', 'Iran (Islamic Republic of)',
'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica',
'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait',
'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia',
'Lebanon', 'Lesotho', 'Liberia', 'Libyan Arab Jamahiriya',
'Liechtenstein', 'Lithuania', 'Luxembourg',
'China, Macao Special Administrative Region', 'Madagascar',
'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius',
'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Monaco',
'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique',
'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands',
'Netherlands Antilles', 'New Caledonia', 'New Zealand',
'Nicaragua', 'Niger', 'Nigeria', 'Niue',
'Northern Mariana Islands', 'Norway',
'Occupied Palestinian Territory', 'Oman', 'Pakistan', 'Palau',
'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar',
'Republic of Korea', 'Republic of Moldova', 'Réunion', 'Romania',
'Russian Federation', 'Rwanda', 'Saint Helena',
'Saint Kitts and Nevis', 'Saint Lucia',
'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines',
'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia',
'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore',
'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia',
'South Africa', 'Spain', 'Sri Lanka', 'South Sudan', 'Sudan',
'Suriname', 'Swaziland', 'Sweden', 'Switzerland',
'Syrian Arab Republic', 'Tajikistan', 'Thailand',
'The former Yugoslav Republic of Macedonia', 'Timor-Leste', 'Togo',
'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda',
'Ukraine', 'United Arab Emirates',
'United Kingdom of Great Britain and Northern Ireland',
'United Republic of Tanzania', 'United States of America',
'United States Virgin Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu',
'Venezuela (Bolivarian Republic of)', 'Viet Nam',
'Wallis and Futuna Islands', 'Western Sahara', 'Yemen', 'Zambia',
'Zimbabwe', 'Czechoslovakia (former)',
'German Democratic Republic', 'Åland Islands', 'Norfolk Island',
'Saint-Barthélemy', 'Saint-Martin (French part)', 'Jersey',
'Svalbard and Jan Mayen Islands', 'USSR (former)',
'Yugoslavia (former)', 'Serbia and Montenegro (former)',
'Egypt and Sudan', 'Nordic countries', 'Other Africa',
'Bangladesh, India and Sri Lanka',
'Pacific Islands Trust Territories', 'Kosovo',
'USSR (former) - unknown', 'USSR (former) - European countries',
'USSR (former) - Asian countries', 'Democratic Yemen (former)',
'Other Latin America and the Caribbean', 'Other Northern America',
'Other Polynesia', 'Other Europe', 'European Union',
'Other Oceania', 'Other Northern Africa', 'Other Caribbean',
'Caribbean Commonwealth (West Indies)', 'Other Central America',
'Other South-Eastern Asia', 'Other South America', 'Other Asia',
'Taiwan, Province of China', 'Other Commonwealth',
'Other Micronesia', 'Other and unknown', 'Other Middle East',
'Other', 'Unknown', 'Stateless', 'African Commonwealth',
'Other Non-Commonwealth', 'Baltic states', 'Guernsey',
'European Union-15', 'European Union-8', 'Other European Union',
'Old Commonwealth', 'New Commonwealth', 'European Union-12',
'Australia and New Zealand', 'Asia Commonwealth',
'America Commonwealth', 'Oceania Commonwealth',
'Europe Commonwealth', 'Africa Commonwealth'], dtype=object)
# países não encontrados
nao_encontrados
['Bolivia', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Hong Kong S.A.R. of China', 'Iran', 'Ivory Coast', 'Laos', 'Libya', 'Moldova', 'North Cyprus', 'North Macedonia', 'Palestinian Territories', 'Russia', 'Somaliland region', 'South Korea', 'Syria', 'Taiwan Province of China', 'Tanzania', 'United Kingdom', 'United States', 'Venezuela', 'Vietnam']
df[ (df['Country name'].isin(nao_encontrados)) & (df['Regional indicator'].isna()) ]['Country name'].unique()
#['Regional indicator'].unique()
array(['Congo (Kinshasa)', 'Somaliland region', 'Syria'], dtype=object)
# Cruzamento de nomes de países entre as bases
obs = {'Bolivia':'Bolivia (Plurinational State of)', 'Congo (Brazzaville)':'Congo', 'Hong Kong S.A.R. of China':'China, Hong Kong Special Administrative Region',
'Iran':'Iran (Islamic Republic of)', 'Ivory Coast':"Côte d'Ivoire", 'Laos':"Lao People's Democratic Republic", 'Libya':'Libyan Arab Jamahiriya',
'Moldova':'Republic of Moldova', 'North Cyprus':'Cyprus', 'North Macedonia':'The former Yugoslav Republic of Macedonia', 'Palestinian Territories':'Occupied Palestinian Territory',
'Russia':'Russian Federation', 'South Korea':'Republic of Korea', 'Taiwan Province of China':"Taiwan, Province of China", 'Tanzania':'United Republic of Tanzania',
'United Kingdom':'United Kingdom of Great Britain and Northern Ireland', 'United States':'United States of America', 'Venezuela':'Venezuela (Bolivarian Republic of)',
'Vietnam':'Viet Nam','Congo (Kinshasa)':'Democratic Republic of the Congo', 'Somaliland region':'Somalia', 'Syria':'Syrian Arab Republic'}
# Somaliland é um país independente e diferente da somália, mas as macroregiões são as mesmas
# preenchimento dos dados faltantes após o cruzamento
for p in obs.keys():
if obs[p] in input_dfs['regions_un']['Country or area'].unique():
df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado_Major']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == obs[p]]['Major area'].values[0]
df.loc[df[ df['Country name'] == p ].index,['Regional_indicator_consultado']] = input_dfs['regions_un'][input_dfs['regions_un']['Country or area'] == obs[p]]['Region'].values[0]
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2098 entries, 0 to 2097 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2098 non-null object 1 Regional indicator 2035 non-null object 2 Regional_indicator_consultado_Major 2098 non-null object 3 Regional_indicator_consultado 2098 non-null object 4 year 2098 non-null int64 5 Ladder score 2098 non-null float64 6 Logged GDP per capita 2062 non-null float64 7 Social support 2085 non-null float64 8 Healthy life expectancy 2043 non-null float64 9 Freedom to make life choices 2066 non-null float64 10 Generosity 2009 non-null float64 11 Perceptions of corruption 1988 non-null float64 12 Positive affect 1927 non-null float64 13 Negative affect 1933 non-null float64 dtypes: float64(9), int64(1), object(4) memory usage: 310.4+ KB
df.to_csv('df_com_zonas.csv')
country_names = [*df['Country name'].unique()]
cn_no_commons = list(set(country_names) - set(casos_singulares) - set(uma_amostra))
for p in df['Country name'].unique():
#for p in cn_no_commons:
#print(p)
for col in df[df['Country name'] == p].drop(columns=['Country name','Regional indicator','Regional_indicator_consultado_Major','Regional_indicator_consultado']):
#print('\t',col)
nadf = df[ (df['Country name'] == p) & (df[col].isna()) ]
#if nadf.shape[0] > 0:
# print(nadf.shape)
if nadf.shape[0] > 0 and nadf.shape[0] <= df[ df['Country name'] == p ].shape[0]-2:
df.loc[ nadf.index,col ] = df[df['Country name'] == p][col].mean()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2098 entries, 0 to 2097 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2098 non-null object 1 Regional indicator 2035 non-null object 2 Regional_indicator_consultado_Major 2098 non-null object 3 Regional_indicator_consultado 2098 non-null object 4 year 2098 non-null int64 5 Ladder score 2098 non-null float64 6 Logged GDP per capita 2079 non-null float64 7 Social support 2097 non-null float64 8 Healthy life expectancy 2062 non-null float64 9 Freedom to make life choices 2098 non-null float64 10 Generosity 2079 non-null float64 11 Perceptions of corruption 2066 non-null float64 12 Positive affect 2095 non-null float64 13 Negative affect 2096 non-null float64 dtypes: float64(9), int64(1), object(4) memory usage: 310.4+ KB
print(f"Singulares: {casos_singulares}\nUma linha: {uma_amostra}")
Singulares: ['Hong Kong S.A.R. of China', 'China', 'Turkmenistan', 'South Sudan', 'Somaliland region', 'Qatar', 'Kosovo', 'North Cyprus', 'Maldives', 'Somalia'] Uma linha: ['Cuba', 'Guyana', 'Oman', 'Suriname']
df[df['Country name'] == 'Hong Kong S.A.R. of China']
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 751 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2006 | 5.511187 | 10.746425 | 0.812178 | NaN | 0.909820 | 0.155567 | 0.355985 | 0.723260 | 0.235955 |
| 752 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2008 | 5.137262 | 10.815545 | 0.840222 | NaN | 0.922211 | 0.296268 | 0.273945 | 0.718972 | 0.236634 |
| 753 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2009 | 5.397056 | 10.788494 | 0.834716 | NaN | 0.918026 | 0.307638 | 0.272125 | 0.762151 | 0.210104 |
| 754 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2010 | 5.642835 | 10.846634 | 0.857314 | NaN | 0.890418 | 0.331955 | 0.255775 | 0.710370 | 0.183106 |
| 755 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2011 | 5.474011 | 10.886932 | 0.846060 | NaN | 0.894330 | 0.234555 | 0.244887 | 0.733887 | 0.195712 |
| 756 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2012 | 5.483765 | 10.892753 | 0.826426 | NaN | 0.879752 | 0.222402 | 0.379783 | 0.715137 | 0.183349 |
| 757 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2014 | 5.458051 | 10.939503 | 0.833558 | NaN | 0.843082 | 0.223799 | 0.422960 | 0.683968 | 0.242868 |
| 758 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2016 | 5.498421 | 10.969857 | 0.832078 | NaN | 0.799743 | 0.100235 | 0.402813 | 0.664093 | 0.213115 |
| 759 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2017 | 5.362475 | 10.999584 | 0.831066 | NaN | 0.830657 | 0.140063 | 0.415810 | 0.639533 | 0.200593 |
| 760 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2019 | 5.659317 | 11.000313 | 0.855826 | NaN | 0.726852 | 0.067344 | 0.431974 | 0.599320 | 0.357607 |
| 761 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2020 | 5.295341 | 10.898759 | 0.812943 | NaN | 0.705452 | 0.195197 | 0.380351 | 0.608647 | 0.210314 |
| 762 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2021 | 5.476700 | 11.000313 | 0.835781 | 76.820091 | 0.716808 | 0.067344 | 0.402650 | 0.687213 | 0.224487 |
df[df['Logged GDP per capita'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 454 | Cuba | None | Latin America and the Caribbean | Caribbean | 2006 | 5.417869 | NaN | 0.969595 | 68.440002 | 0.281458 | NaN | NaN | 0.646712 | 0.276602 |
| 1381 | North Cyprus | Western Europe | Asia | Western Asia | 2012 | 5.463305 | NaN | 0.871150 | NaN | 0.692568 | NaN | 0.854730 | 0.709236 | 0.405435 |
| 1382 | North Cyprus | Western Europe | Asia | Western Asia | 2013 | 5.566803 | NaN | 0.869274 | NaN | 0.775383 | NaN | 0.715356 | 0.621554 | 0.442972 |
| 1383 | North Cyprus | Western Europe | Asia | Western Asia | 2014 | 5.785979 | NaN | 0.801802 | NaN | 0.829677 | NaN | 0.692221 | 0.723842 | 0.311336 |
| 1384 | North Cyprus | Western Europe | Asia | Western Asia | 2015 | 5.842550 | NaN | 0.791383 | NaN | 0.785353 | NaN | 0.659180 | 0.701609 | 0.318930 |
| 1385 | North Cyprus | Western Europe | Asia | Western Asia | 2016 | 5.827128 | NaN | 0.807690 | NaN | 0.796234 | NaN | 0.670191 | 0.643664 | 0.346465 |
| 1386 | North Cyprus | Western Europe | Asia | Western Asia | 2018 | 5.608056 | NaN | 0.837392 | NaN | 0.797066 | NaN | 0.613837 | 0.480453 | 0.261868 |
| 1387 | North Cyprus | Western Europe | Asia | Western Asia | 2019 | 5.466615 | NaN | 0.803295 | NaN | 0.792735 | NaN | 0.640059 | 0.493693 | 0.296411 |
| 1681 | Somalia | None | Africa | Eastern Africa | 2014 | 5.528273 | NaN | 0.610836 | 49.599998 | 0.873879 | NaN | 0.456470 | 0.834454 | 0.207215 |
| 1682 | Somalia | None | Africa | Eastern Africa | 2015 | 5.353645 | NaN | 0.599281 | 50.099998 | 0.967869 | NaN | 0.410236 | 0.900668 | 0.186736 |
| 1683 | Somalia | None | Africa | Eastern Africa | 2016 | 4.667941 | NaN | 0.594417 | 50.000000 | 0.917323 | NaN | 0.440802 | 0.891423 | 0.193282 |
| 1684 | Somaliland region | None | Africa | Eastern Africa | 2009 | 4.991400 | NaN | 0.879567 | NaN | 0.746304 | NaN | 0.513372 | 0.818879 | 0.112012 |
| 1685 | Somaliland region | None | Africa | Eastern Africa | 2010 | 4.657363 | NaN | 0.829005 | NaN | 0.820182 | NaN | 0.471094 | 0.769375 | 0.083426 |
| 1686 | Somaliland region | None | Africa | Eastern Africa | 2011 | 4.930572 | NaN | 0.787962 | NaN | 0.858104 | NaN | 0.357341 | 0.748686 | 0.122244 |
| 1687 | Somaliland region | None | Africa | Eastern Africa | 2012 | 5.057314 | NaN | 0.786291 | NaN | 0.758219 | NaN | 0.333832 | 0.735189 | 0.152428 |
| 1720 | South Sudan | None | Africa | Northern Africa | 2014 | 3.831992 | NaN | 0.545118 | 49.840000 | 0.567259 | NaN | 0.741541 | 0.614024 | 0.428320 |
| 1721 | South Sudan | None | Africa | Northern Africa | 2015 | 4.070771 | NaN | 0.584781 | 50.200001 | 0.511631 | NaN | 0.709606 | 0.586278 | 0.449795 |
| 1722 | South Sudan | None | Africa | Northern Africa | 2016 | 2.888112 | NaN | 0.532152 | 50.599998 | 0.439919 | NaN | 0.785318 | 0.614771 | 0.549257 |
| 1723 | South Sudan | None | Africa | Northern Africa | 2017 | 2.816622 | NaN | 0.556823 | 51.000000 | 0.456011 | NaN | 0.761270 | 0.585602 | 0.517364 |
df[df['Social support'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1414 | Oman | None | Asia | Western Asia | 2011 | 6.852982 | 10.382462 | NaN | 65.5 | 0.916293 | 0.024908 | NaN | NaN | 0.295164 |
df[df['Healthy life expectancy'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 751 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2006 | 5.511187 | 10.746425 | 0.812178 | NaN | 0.909820 | 0.155567 | 0.355985 | 0.723260 | 0.235955 |
| 752 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2008 | 5.137262 | 10.815545 | 0.840222 | NaN | 0.922211 | 0.296268 | 0.273945 | 0.718972 | 0.236634 |
| 753 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2009 | 5.397056 | 10.788494 | 0.834716 | NaN | 0.918026 | 0.307638 | 0.272125 | 0.762151 | 0.210104 |
| 754 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2010 | 5.642835 | 10.846634 | 0.857314 | NaN | 0.890418 | 0.331955 | 0.255775 | 0.710370 | 0.183106 |
| 755 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2011 | 5.474011 | 10.886932 | 0.846060 | NaN | 0.894330 | 0.234555 | 0.244887 | 0.733887 | 0.195712 |
| 756 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2012 | 5.483765 | 10.892753 | 0.826426 | NaN | 0.879752 | 0.222402 | 0.379783 | 0.715137 | 0.183349 |
| 757 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2014 | 5.458051 | 10.939503 | 0.833558 | NaN | 0.843082 | 0.223799 | 0.422960 | 0.683968 | 0.242868 |
| 758 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2016 | 5.498421 | 10.969857 | 0.832078 | NaN | 0.799743 | 0.100235 | 0.402813 | 0.664093 | 0.213115 |
| 759 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2017 | 5.362475 | 10.999584 | 0.831066 | NaN | 0.830657 | 0.140063 | 0.415810 | 0.639533 | 0.200593 |
| 760 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2019 | 5.659317 | 11.000313 | 0.855826 | NaN | 0.726852 | 0.067344 | 0.431974 | 0.599320 | 0.357607 |
| 761 | Hong Kong S.A.R. of China | East Asia | Asia | Eastern Asia | 2020 | 5.295341 | 10.898759 | 0.812943 | NaN | 0.705452 | 0.195197 | 0.380351 | 0.608647 | 0.210314 |
| 973 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2007 | 5.103906 | 8.927753 | 0.847812 | NaN | 0.381364 | 0.143901 | 0.894462 | 0.654866 | 0.236699 |
| 974 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2008 | 5.521660 | 8.980872 | 0.883843 | NaN | 0.664265 | 0.090464 | 0.849059 | 0.693481 | 0.317828 |
| 975 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2009 | 5.891433 | 9.008162 | 0.830427 | NaN | 0.506415 | 0.200504 | 0.967839 | 0.597583 | 0.168830 |
| 976 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2010 | 5.176601 | 9.032693 | 0.707959 | NaN | 0.451444 | 0.169696 | 0.967272 | 0.695178 | 0.117717 |
| 977 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2011 | 4.859502 | 9.066925 | 0.759102 | NaN | 0.588979 | 0.003699 | 0.919212 | 0.695966 | 0.124438 |
| 978 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2012 | 5.639588 | 9.085688 | 0.757147 | NaN | 0.635793 | 0.027182 | 0.949651 | 0.595572 | 0.099630 |
| 979 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2013 | 6.125758 | 9.113430 | 0.720750 | NaN | 0.568463 | 0.114904 | 0.935095 | 0.691511 | 0.202731 |
| 980 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2014 | 5.000375 | 9.128522 | 0.705632 | NaN | 0.441391 | 0.012095 | 0.775201 | 0.636128 | 0.205950 |
| 981 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2015 | 5.077461 | 9.182307 | 0.805271 | NaN | 0.561048 | 0.180851 | 0.850647 | 0.753090 | 0.179989 |
| 982 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2016 | 5.759412 | 9.228177 | 0.823803 | NaN | 0.827399 | 0.124869 | 0.940898 | 0.703887 | 0.149607 |
| 983 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2017 | 6.149200 | 9.262030 | 0.792087 | NaN | 0.857677 | 0.117175 | 0.925192 | 0.738436 | 0.185879 |
| 984 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2018 | 6.391826 | 9.296085 | 0.822407 | NaN | 0.889737 | 0.268795 | 0.922078 | 0.778271 | 0.170248 |
| 985 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2019 | 6.425144 | 9.338535 | 0.842511 | NaN | 0.841190 | 0.246990 | 0.920297 | 0.748522 | 0.140792 |
| 986 | Kosovo | Central and Eastern Europe | Europe | Southern Europe | 2020 | 6.294414 | 9.140673 | 0.792374 | NaN | 0.879838 | 0.139896 | 0.909894 | 0.726240 | 0.201458 |
| 1381 | North Cyprus | Western Europe | Asia | Western Asia | 2012 | 5.463305 | NaN | 0.871150 | NaN | 0.692568 | NaN | 0.854730 | 0.709236 | 0.405435 |
| 1382 | North Cyprus | Western Europe | Asia | Western Asia | 2013 | 5.566803 | NaN | 0.869274 | NaN | 0.775383 | NaN | 0.715356 | 0.621554 | 0.442972 |
| 1383 | North Cyprus | Western Europe | Asia | Western Asia | 2014 | 5.785979 | NaN | 0.801802 | NaN | 0.829677 | NaN | 0.692221 | 0.723842 | 0.311336 |
| 1384 | North Cyprus | Western Europe | Asia | Western Asia | 2015 | 5.842550 | NaN | 0.791383 | NaN | 0.785353 | NaN | 0.659180 | 0.701609 | 0.318930 |
| 1385 | North Cyprus | Western Europe | Asia | Western Asia | 2016 | 5.827128 | NaN | 0.807690 | NaN | 0.796234 | NaN | 0.670191 | 0.643664 | 0.346465 |
| 1386 | North Cyprus | Western Europe | Asia | Western Asia | 2018 | 5.608056 | NaN | 0.837392 | NaN | 0.797066 | NaN | 0.613837 | 0.480453 | 0.261868 |
| 1387 | North Cyprus | Western Europe | Asia | Western Asia | 2019 | 5.466615 | NaN | 0.803295 | NaN | 0.792735 | NaN | 0.640059 | 0.493693 | 0.296411 |
| 1684 | Somaliland region | None | Africa | Eastern Africa | 2009 | 4.991400 | NaN | 0.879567 | NaN | 0.746304 | NaN | 0.513372 | 0.818879 | 0.112012 |
| 1685 | Somaliland region | None | Africa | Eastern Africa | 2010 | 4.657363 | NaN | 0.829005 | NaN | 0.820182 | NaN | 0.471094 | 0.769375 | 0.083426 |
| 1686 | Somaliland region | None | Africa | Eastern Africa | 2011 | 4.930572 | NaN | 0.787962 | NaN | 0.858104 | NaN | 0.357341 | 0.748686 | 0.122244 |
| 1687 | Somaliland region | None | Africa | Eastern Africa | 2012 | 5.057314 | NaN | 0.786291 | NaN | 0.758219 | NaN | 0.333832 | 0.735189 | 0.152428 |
df[df['Generosity'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 454 | Cuba | None | Latin America and the Caribbean | Caribbean | 2006 | 5.417869 | NaN | 0.969595 | 68.440002 | 0.281458 | NaN | NaN | 0.646712 | 0.276602 |
| 1381 | North Cyprus | Western Europe | Asia | Western Asia | 2012 | 5.463305 | NaN | 0.871150 | NaN | 0.692568 | NaN | 0.854730 | 0.709236 | 0.405435 |
| 1382 | North Cyprus | Western Europe | Asia | Western Asia | 2013 | 5.566803 | NaN | 0.869274 | NaN | 0.775383 | NaN | 0.715356 | 0.621554 | 0.442972 |
| 1383 | North Cyprus | Western Europe | Asia | Western Asia | 2014 | 5.785979 | NaN | 0.801802 | NaN | 0.829677 | NaN | 0.692221 | 0.723842 | 0.311336 |
| 1384 | North Cyprus | Western Europe | Asia | Western Asia | 2015 | 5.842550 | NaN | 0.791383 | NaN | 0.785353 | NaN | 0.659180 | 0.701609 | 0.318930 |
| 1385 | North Cyprus | Western Europe | Asia | Western Asia | 2016 | 5.827128 | NaN | 0.807690 | NaN | 0.796234 | NaN | 0.670191 | 0.643664 | 0.346465 |
| 1386 | North Cyprus | Western Europe | Asia | Western Asia | 2018 | 5.608056 | NaN | 0.837392 | NaN | 0.797066 | NaN | 0.613837 | 0.480453 | 0.261868 |
| 1387 | North Cyprus | Western Europe | Asia | Western Asia | 2019 | 5.466615 | NaN | 0.803295 | NaN | 0.792735 | NaN | 0.640059 | 0.493693 | 0.296411 |
| 1681 | Somalia | None | Africa | Eastern Africa | 2014 | 5.528273 | NaN | 0.610836 | 49.599998 | 0.873879 | NaN | 0.456470 | 0.834454 | 0.207215 |
| 1682 | Somalia | None | Africa | Eastern Africa | 2015 | 5.353645 | NaN | 0.599281 | 50.099998 | 0.967869 | NaN | 0.410236 | 0.900668 | 0.186736 |
| 1683 | Somalia | None | Africa | Eastern Africa | 2016 | 4.667941 | NaN | 0.594417 | 50.000000 | 0.917323 | NaN | 0.440802 | 0.891423 | 0.193282 |
| 1684 | Somaliland region | None | Africa | Eastern Africa | 2009 | 4.991400 | NaN | 0.879567 | NaN | 0.746304 | NaN | 0.513372 | 0.818879 | 0.112012 |
| 1685 | Somaliland region | None | Africa | Eastern Africa | 2010 | 4.657363 | NaN | 0.829005 | NaN | 0.820182 | NaN | 0.471094 | 0.769375 | 0.083426 |
| 1686 | Somaliland region | None | Africa | Eastern Africa | 2011 | 4.930572 | NaN | 0.787962 | NaN | 0.858104 | NaN | 0.357341 | 0.748686 | 0.122244 |
| 1687 | Somaliland region | None | Africa | Eastern Africa | 2012 | 5.057314 | NaN | 0.786291 | NaN | 0.758219 | NaN | 0.333832 | 0.735189 | 0.152428 |
| 1720 | South Sudan | None | Africa | Northern Africa | 2014 | 3.831992 | NaN | 0.545118 | 49.840000 | 0.567259 | NaN | 0.741541 | 0.614024 | 0.428320 |
| 1721 | South Sudan | None | Africa | Northern Africa | 2015 | 4.070771 | NaN | 0.584781 | 50.200001 | 0.511631 | NaN | 0.709606 | 0.586278 | 0.449795 |
| 1722 | South Sudan | None | Africa | Northern Africa | 2016 | 2.888112 | NaN | 0.532152 | 50.599998 | 0.439919 | NaN | 0.785318 | 0.614771 | 0.549257 |
| 1723 | South Sudan | None | Africa | Northern Africa | 2017 | 2.816622 | NaN | 0.556823 | 51.000000 | 0.456011 | NaN | 0.761270 | 0.585602 | 0.517364 |
df[df['Perceptions of corruption'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 367 | China | East Asia | Asia | Eastern Asia | 2006 | 4.560495 | 8.696120 | 0.747011 | 66.879997 | 0.851083 | -0.169039 | NaN | 0.809295 | 0.169580 |
| 368 | China | East Asia | Asia | Eastern Asia | 2007 | 4.862862 | 8.823954 | 0.810852 | 67.059998 | 0.851083 | -0.176243 | NaN | 0.817485 | 0.158614 |
| 369 | China | East Asia | Asia | Eastern Asia | 2008 | 4.846295 | 8.910992 | 0.748287 | 67.239998 | 0.853072 | -0.092472 | NaN | 0.817443 | 0.146963 |
| 370 | China | East Asia | Asia | Eastern Asia | 2009 | 4.454361 | 8.995857 | 0.798034 | 67.419998 | 0.771143 | -0.160481 | NaN | 0.785806 | 0.161650 |
| 371 | China | East Asia | Asia | Eastern Asia | 2010 | 4.652737 | 9.092104 | 0.767753 | 67.599998 | 0.804794 | -0.133318 | NaN | 0.765265 | 0.158100 |
| 372 | China | East Asia | Asia | Eastern Asia | 2011 | 5.037208 | 9.178532 | 0.787171 | 67.760002 | 0.824162 | -0.186383 | NaN | 0.820074 | 0.133503 |
| 373 | China | East Asia | Asia | Eastern Asia | 2012 | 5.094917 | 9.249320 | 0.787818 | 67.919998 | 0.808255 | -0.184676 | NaN | 0.820785 | 0.158703 |
| 374 | China | East Asia | Asia | Eastern Asia | 2013 | 5.241090 | 9.319200 | 0.777896 | 68.080002 | 0.804724 | -0.157777 | NaN | 0.836431 | 0.142211 |
| 375 | China | East Asia | Asia | Eastern Asia | 2014 | 5.195619 | 9.385755 | 0.820366 | 68.239998 | 0.851083 | -0.216772 | NaN | 0.853975 | 0.111518 |
| 376 | China | East Asia | Asia | Eastern Asia | 2015 | 5.303878 | 9.448723 | 0.793734 | 68.400002 | 0.851083 | -0.244435 | NaN | 0.808911 | 0.171315 |
| 377 | China | East Asia | Asia | Eastern Asia | 2016 | 5.324956 | 9.509552 | 0.741703 | 68.699997 | 0.851083 | -0.227522 | NaN | 0.826144 | 0.145625 |
| 378 | China | East Asia | Asia | Eastern Asia | 2017 | 5.099061 | 9.571116 | 0.772033 | 69.000000 | 0.877618 | -0.174832 | NaN | 0.821097 | 0.214005 |
| 379 | China | East Asia | Asia | Eastern Asia | 2018 | 5.131434 | 9.631892 | 0.787605 | 69.300003 | 0.895378 | -0.158510 | NaN | 0.855784 | 0.189640 |
| 380 | China | East Asia | Asia | Eastern Asia | 2019 | 5.144120 | 9.687612 | 0.821936 | 69.599998 | 0.927356 | -0.173036 | NaN | 0.890780 | 0.146512 |
| 381 | China | East Asia | Asia | Eastern Asia | 2020 | 5.771065 | 9.701755 | 0.808334 | 69.900002 | 0.891123 | -0.103214 | NaN | 0.789345 | 0.244918 |
| 454 | Cuba | None | Latin America and the Caribbean | Caribbean | 2006 | 5.417869 | NaN | 0.969595 | 68.440002 | 0.281458 | NaN | NaN | 0.646712 | 0.276602 |
| 1144 | Maldives | South Asia | Asia | Southern Asia | 2018 | 5.197575 | 9.825986 | 0.913315 | 70.599998 | 0.854759 | 0.023998 | NaN | NaN | NaN |
| 1414 | Oman | None | Asia | Western Asia | 2011 | 6.852982 | 10.382462 | NaN | 65.500000 | 0.916293 | 0.024908 | NaN | NaN | 0.295164 |
| 1535 | Qatar | None | Asia | Western Asia | 2010 | 6.849653 | 11.519814 | 0.863325 | 66.699997 | 0.898004 | 0.103687 | NaN | 0.734913 | 0.302685 |
| 1536 | Qatar | None | Asia | Western Asia | 2011 | 6.591604 | 11.553021 | 0.857351 | 67.019997 | 0.904687 | 0.011700 | NaN | 0.760927 | 0.327790 |
| 1537 | Qatar | None | Asia | Western Asia | 2012 | 6.611299 | 11.523082 | 0.838132 | 67.339996 | 0.924334 | 0.161530 | NaN | 0.765899 | 0.322181 |
| 1538 | Qatar | None | Asia | Western Asia | 2015 | 6.374529 | 11.485615 | 0.863325 | 68.300003 | 0.898004 | 0.127954 | NaN | 0.734913 | 0.302685 |
| 1904 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2009 | 6.567713 | 8.989171 | 0.923846 | 59.439999 | 0.787891 | -0.101684 | NaN | 0.780770 | 0.151584 |
| 1905 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2011 | 5.791755 | 9.181697 | 0.964419 | 60.040001 | 0.787891 | 0.018397 | NaN | 0.639033 | 0.122068 |
| 1906 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2012 | 5.463827 | 9.268988 | 0.945841 | 60.279999 | 0.785563 | -0.122812 | NaN | 0.584448 | 0.116881 |
| 1907 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2013 | 5.391763 | 9.347593 | 0.845733 | 60.520000 | 0.704529 | -0.071448 | NaN | 0.598716 | 0.159606 |
| 1908 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2014 | 5.787379 | 9.427173 | 0.908927 | 60.759998 | 0.804678 | 0.031971 | NaN | 0.695216 | 0.153950 |
| 1909 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2015 | 5.791460 | 9.472206 | 0.960158 | 61.000000 | 0.701358 | 0.092775 | NaN | 0.705348 | 0.301039 |
| 1910 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2016 | 5.887052 | 9.515066 | 0.929032 | 61.400002 | 0.748504 | 0.004624 | NaN | 0.636389 | 0.255499 |
| 1911 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2017 | 5.229149 | 9.561351 | 0.908455 | 61.799999 | 0.720399 | 0.066041 | NaN | 0.520885 | 0.349628 |
| 1912 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2018 | 4.620602 | 9.605440 | 0.984489 | 62.200001 | 0.857774 | 0.259659 | NaN | 0.612210 | 0.189025 |
| 1913 | Turkmenistan | Commonwealth of Independent States | Asia | Central Asia | 2019 | 5.474300 | 9.651184 | 0.981502 | 62.599998 | 0.891527 | 0.284881 | NaN | 0.509915 | 0.183343 |
df[df['Positive affect'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1144 | Maldives | South Asia | Asia | Southern Asia | 2018 | 5.197575 | 9.825986 | 0.913315 | 70.599998 | 0.854759 | 0.023998 | NaN | NaN | NaN |
| 1145 | Maldives | South Asia | Asia | Southern Asia | 2021 | 5.197600 | 9.825986 | 0.913161 | 70.599998 | 0.853963 | 0.023998 | 0.82465 | NaN | NaN |
| 1414 | Oman | None | Asia | Western Asia | 2011 | 6.852982 | 10.382462 | NaN | 65.500000 | 0.916293 | 0.024908 | NaN | NaN | 0.295164 |
df[df['Negative affect'].isna()]
| Country name | Regional indicator | Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1144 | Maldives | South Asia | Asia | Southern Asia | 2018 | 5.197575 | 9.825986 | 0.913315 | 70.599998 | 0.854759 | 0.023998 | NaN | NaN | NaN |
| 1145 | Maldives | South Asia | Asia | Southern Asia | 2021 | 5.197600 | 9.825986 | 0.913161 | 70.599998 | 0.853963 | 0.023998 | 0.82465 | NaN | NaN |
df_corte = df[ (~df['Country name'].isin(casos_singulares)) & (~df['Country name'].isin(uma_amostra)) ]
df_corte.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2014 entries, 0 to 2097 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2014 non-null object 1 Regional indicator 1971 non-null object 2 Regional_indicator_consultado_Major 2014 non-null object 3 Regional_indicator_consultado 2014 non-null object 4 year 2014 non-null int64 5 Ladder score 2014 non-null float64 6 Logged GDP per capita 2014 non-null float64 7 Social support 2014 non-null float64 8 Healthy life expectancy 2014 non-null float64 9 Freedom to make life choices 2014 non-null float64 10 Generosity 2014 non-null float64 11 Perceptions of corruption 2014 non-null float64 12 Positive affect 2014 non-null float64 13 Negative affect 2014 non-null float64 dtypes: float64(9), int64(1), object(4) memory usage: 236.0+ KB
df_corte.to_csv('df_preenchido_descarte.csv')
ao_ano_corte = df_corte.groupby(by='year',as_index=False).agg('mean')
corte_scaler = MinMaxScaler()
sc_ano_corte = pd.DataFrame(corte_scaler.fit_transform(ao_ano_corte.drop(columns='year')),columns=ao_ano_corte.drop(columns='year').columns)
sc_ano_corte['year'] = ao_ano_corte['year']
sc_ano_corte = sc_ano_corte[[*ao_ano_corte.columns]]
print(f"Sobre o scaler dos dados com na:\nFeatures: {corte_scaler.feature_names_in_}\nValores Máximos: {corte_scaler.data_max_}\nValores Mínimos: {corte_scaler.data_min_}\nFaixa de valores: {corte_scaler.feature_range}\nParâmetros gerais: {corte_scaler.get_params()}")
Sobre o scaler dos dados com na:
Features: ['Ladder score' 'Logged GDP per capita' 'Social support'
'Healthy life expectancy' 'Freedom to make life choices' 'Generosity'
'Perceptions of corruption' 'Positive affect' 'Negative affect']
Valores Máximos: [6.44616427e+00 1.01186379e+01 8.97366864e-01 6.70925527e+01
8.22788169e-01 1.99202307e-02 7.88631714e-01 7.43805360e-01
2.95408072e-01]
Valores Mínimos: [ 5.19811267e+00 9.02855413e+00 7.83287293e-01 5.99727907e+01
6.84637176e-01 -2.76569087e-02 7.03033391e-01 7.00210670e-01
2.43803148e-01]
Faixa de valores: (0, 1)
Parâmetros gerais: {'clip': False, 'copy': True, 'feature_range': (0, 1)}
fig = go.Figure()
cols = [*sc_ano_corte.columns]
cols.remove('year')
for column in cols:
fig.add_trace(go.Scatter( x = sc_ano_corte.year, y = sc_ano_corte[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores Escalonados Globais por Ano [com dados descartados]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_global_norm_descarte.html')
Todos foram impactados pela pandemia, embora Perceptions of corruption, Generosity tenham comportamento invertido
br_ano_corte = df_corte[df_corte['Country name']=='Brazil'].groupby(by='year',as_index=False).agg('mean')
cols = [*br_ano_corte.columns]
cols.remove('year')
fig = go.Figure()
for column in cols:
fig.add_trace(go.Scatter( x = br_ano_corte.year, y = br_ano_corte[column], name = column, mode = 'lines') )
fig.update_layout(title = "Indicadores por Ano [Brasil, não-normalizados, pós-descarte]", xaxis_title = 'Ano')
fig.show()
fig.write_html('indicadores_br_raw_descarte.html')
df1 = pd.read_csv('df_preenchido_descarte.csv', index_col=0) # dataset com dados excluídos
df1.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)
aux_columns = df1[['Regional_indicator_consultado_Major','Regional_indicator_consultado']]
df1.drop(columns=['Regional_indicator_consultado_Major','Regional_indicator_consultado'], inplace=True) # removendo colunas auxiliares
df1
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | South Asia | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | South Asia | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | South Asia | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Afghanistan | South Asia | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Afghanistan | South Asia | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2009 | Zimbabwe | Sub-Saharan Africa | 2017 | 3.638300 | 8.015738 | 0.754147 | 55.000000 | 0.752826 | -0.097645 | 0.751208 | 0.806428 | 0.224051 |
| 2010 | Zimbabwe | Sub-Saharan Africa | 2018 | 3.616480 | 8.048798 | 0.775388 | 55.599998 | 0.762675 | -0.068427 | 0.844209 | 0.710119 | 0.211726 |
| 2011 | Zimbabwe | Sub-Saharan Africa | 2019 | 2.693523 | 7.950132 | 0.759162 | 56.200001 | 0.631908 | -0.063791 | 0.830652 | 0.716004 | 0.235354 |
| 2012 | Zimbabwe | Sub-Saharan Africa | 2020 | 3.159802 | 7.828757 | 0.717243 | 56.799999 | 0.643303 | -0.008696 | 0.788523 | 0.702573 | 0.345736 |
| 2013 | Zimbabwe | Sub-Saharan Africa | 2021 | 3.144800 | 7.942595 | 0.750470 | 56.200840 | 0.676700 | -0.047346 | 0.820999 | 0.717712 | 0.224420 |
2014 rows × 12 columns
df1['Regional indicator'].value_counts()
Sub-Saharan Africa 426 Latin America and Caribbean 299 Western Europe 284 Middle East and North Africa 228 Central and Eastern Europe 227 Commonwealth of Independent States 171 Southeast Asia 125 South Asia 89 North America and ANZ 62 East Asia 60 Name: Regional indicator, dtype: int64
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2014 entries, 0 to 2013 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2014 non-null object 1 Regional indicator 1971 non-null object 2 year 2014 non-null int64 3 Ladder score 2014 non-null float64 4 Logged GDP per capita 2014 non-null float64 5 Social support 2014 non-null float64 6 Healthy life expectancy 2014 non-null float64 7 Freedom to make life choices 2014 non-null float64 8 Generosity 2014 non-null float64 9 Perceptions of corruption 2014 non-null float64 10 Positive affect 2014 non-null float64 11 Negative affect 2014 non-null float64 dtypes: float64(9), int64(1), object(2) memory usage: 188.9+ KB
colunas_originais = [*df1.columns]
sem_tgt_paises = [*df1.columns]
sem_tgt_paises.remove('Country name')
sem_tgt_paises.remove('Regional indicator') # colunas do dataset original sem target e sem paises
countries = df1['Country name']
Descartados_profile = pr(df1, title="Profile Report com dados descartados", explorative=True, progress_bar=False)
Descartados_profile.to_file(f"profile_com_descartados.html")
df2 = df1.copy()
df1 = pd.get_dummies(df1, columns=['Country name'], prefix='', prefix_sep='', sparse=False, dtype=bool)
df_target = df1[df1['Regional indicator'].isna()] # para o random forests
df1 = df1[~df1['Regional indicator'].isna()] # para o random forests
df_target2 = df2[df2['Regional indicator'].isna()] # para o catboost
df2 = df2[~df2['Regional indicator'].isna()] # para o catboost
target_countries = countries[countries.index.isin(df_target.index)]
df2.head(3)
| Country name | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | South Asia | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Afghanistan | South Asia | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Afghanistan | South Asia | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
df1.head(3)
| Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | ... | United Arab Emirates | United Kingdom | United States | Uruguay | Uzbekistan | Venezuela | Vietnam | Yemen | Zambia | Zimbabwe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | South Asia | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | South Asia | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | South Asia | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | ... | False | False | False | False | False | False | False | False | False | False |
3 rows × 163 columns
df_target.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 43 entries, 36 to 1801 Columns: 163 entries, Regional indicator to Zimbabwe dtypes: bool(152), float64(9), int64(1), object(1) memory usage: 10.4+ KB
df_target2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 43 entries, 36 to 1801 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 43 non-null object 1 Regional indicator 0 non-null object 2 year 43 non-null int64 3 Ladder score 43 non-null float64 4 Logged GDP per capita 43 non-null float64 5 Social support 43 non-null float64 6 Healthy life expectancy 43 non-null float64 7 Freedom to make life choices 43 non-null float64 8 Generosity 43 non-null float64 9 Perceptions of corruption 43 non-null float64 10 Positive affect 43 non-null float64 11 Negative affect 43 non-null float64 dtypes: float64(9), int64(1), object(2) memory usage: 4.4+ KB
y = df1['Regional indicator']
y2 = df2['Regional indicator']
sm = SMOTE(random_state=42, n_jobs=-1)
X_res, y_res = sm.fit_resample(df1.drop(columns=['Regional indicator']), df1['Regional indicator'])
print(f"antes: {df1.shape[0]}, depois: {X_res.shape[0]}")
antes: 1971, depois: 4260
y_res.value_counts()
South Asia 426 Central and Eastern Europe 426 Middle East and North Africa 426 Latin America and Caribbean 426 Commonwealth of Independent States 426 North America and ANZ 426 Western Europe 426 Sub-Saharan Africa 426 Southeast Asia 426 East Asia 426 Name: Regional indicator, dtype: int64
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=[0])
X_res2, y_res2 = smnc.fit_resample(df2.drop(columns=['Regional indicator']), df2['Regional indicator'])
print(f"antes: {df2.shape[0]}, depois: {X_res2.shape[0]}")
antes: 1971, depois: 4260
y_res2.value_counts()
South Asia 426 Central and Eastern Europe 426 Middle East and North Africa 426 Latin America and Caribbean 426 Commonwealth of Independent States 426 North America and ANZ 426 Western Europe 426 Sub-Saharan Africa 426 Southeast Asia 426 East Asia 426 Name: Regional indicator, dtype: int64
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.15, random_state=42)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
RandomForestClassifier(max_depth=2, random_state=0)
clf.score(X_test,y_test)
0.7449139280125195
y_tgt = clf.predict( df_target.drop(columns=['Regional indicator']) )
df_target['Country name'] = target_countries
df_target['região estimada'] = y_tgt
df_target = df_target[['Country name','região estimada','Regional indicator', 'year', 'Ladder score',
'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
'Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
'Positive affect', 'Negative affect']]
df_target.head(5)
| Country name | região estimada | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36 | Angola | Sub-Saharan Africa | NaN | 2011 | 5.589001 | 8.945782 | 0.723094 | 52.500000 | 0.583702 | 0.055257 | 0.911320 | 0.658647 | 0.361063 |
| 37 | Angola | Sub-Saharan Africa | NaN | 2012 | 4.360250 | 8.991773 | 0.752593 | 53.200001 | 0.456029 | -0.136070 | 0.906300 | 0.557908 | 0.304890 |
| 38 | Angola | Sub-Saharan Africa | NaN | 2013 | 3.937107 | 9.004611 | 0.721591 | 53.900002 | 0.409555 | -0.103557 | 0.816375 | 0.658284 | 0.370875 |
| 39 | Angola | Sub-Saharan Africa | NaN | 2014 | 3.794838 | 9.016735 | 0.754615 | 54.599998 | 0.374542 | -0.167723 | 0.834076 | 0.578517 | 0.367864 |
| 173 | Belize | Latin America and Caribbean | NaN | 2007 | 6.450644 | 8.892479 | 0.872267 | 61.599998 | 0.705306 | 0.032754 | 0.768984 | 0.758783 | 0.250596 |
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
print(f"ROC AUC: {round(roc_auc_score(y_test, clf.predict_proba(X_test), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test, clf.predict(X_test), average='weighted'),4)}")
print(classification_report(y_test, clf.predict(X_test)))#, target_names=labels))
ROC AUC: 0.9496
Recall: 0.7449
precision recall f1-score support
Central and Eastern Europe 0.91 0.68 0.78 63
Commonwealth of Independent States 0.95 0.78 0.86 78
East Asia 0.86 1.00 0.93 64
Latin America and Caribbean 0.82 0.78 0.80 60
Middle East and North Africa 0.72 0.58 0.64 59
North America and ANZ 0.57 1.00 0.73 66
South Asia 0.68 0.74 0.71 61
Southeast Asia 0.81 0.84 0.83 57
Sub-Saharan Africa 0.72 0.81 0.76 67
Western Europe 0.40 0.22 0.28 64
accuracy 0.74 639
macro avg 0.75 0.74 0.73 639
weighted avg 0.75 0.74 0.73 639
df1['Regional indicator'] = y_res
df1.to_csv('descarte_balanceado_rf.csv')
res1 = df_target[['Country name','região estimada']]
res1 = pd.concat([res1,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res1
| Country name | região estimada | Regional_indicator_consultado_Major | Regional_indicator_consultado | |
|---|---|---|---|---|
| 36 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 37 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 38 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 39 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 173 | Belize | Latin America and Caribbean | Latin America and the Caribbean | Central America |
| 174 | Belize | Latin America and Caribbean | Latin America and the Caribbean | Central America |
| 188 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 189 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 190 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 331 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 332 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 333 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 334 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 335 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 401 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 402 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 403 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 404 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 405 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 406 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 407 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 408 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 481 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 482 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 483 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 484 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 1682 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1683 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1684 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1685 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1686 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1718 | Syria | South Asia | Asia | Western Asia |
| 1719 | Syria | Sub-Saharan Africa | Asia | Western Asia |
| 1720 | Syria | Central and Eastern Europe | Asia | Western Asia |
| 1721 | Syria | South Asia | Asia | Western Asia |
| 1722 | Syria | South Asia | Asia | Western Asia |
| 1723 | Syria | South Asia | Asia | Western Asia |
| 1724 | Syria | Sub-Saharan Africa | Asia | Western Asia |
| 1797 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1798 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1799 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1800 | Trinidad and Tobago | Southeast Asia | Latin America and the Caribbean | Caribbean |
| 1801 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
df_target[ (df_target['Country name'] == 'Syria') & (df_target['região estimada'] == 'Central and Eastern Europe') ]
| Country name | região estimada | Regional indicator | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1720 | Syria | Central and Eastern Europe | NaN | 2010 | 4.464708 | 8.729084 | 0.934232 | 64.099998 | 0.647048 | 0.007883 | 0.743094 | 0.557652 | 0.224644 |
df1[ (df1['year'] == 2010) & (df1['Regional indicator'] == 'Central and Eastern Europe') ][['Regional indicator', 'year', 'Ladder score','Logged GDP per capita', 'Social support',
'Healthy life expectancy','Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
'Positive affect', 'Negative affect']].groupby(by='Regional indicator').agg('mean')
| year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|
| Regional indicator | ||||||||||
| Central and Eastern Europe | 2010.0 | 5.142485 | 8.820354 | 0.801815 | 59.317647 | 0.688035 | -0.033074 | 0.782977 | 0.669645 | 0.246042 |
joblib.dump(clf, 'rf0.joblib')
df_target.to_csv('rf_resultados_tabela.csv')
res1.to_csv('rf_comp_direta.csv')
fi = pd.DataFrame({'features':X_train.columns,'importances':clf.feature_importances_})
fi.sort_values(by='importances', ascending=False, inplace=True)
fi.index = range(fi.shape[0])
sem_imp = fi[fi.importances == 0]
fi = fi[fi.importances > 0]
fi[fi.importances > 0.01]
| features | importances | |
|---|---|---|
| 0 | Logged GDP per capita | 0.098954 |
| 1 | Ladder score | 0.089725 |
| 2 | Healthy life expectancy | 0.086614 |
| 3 | Positive affect | 0.079099 |
| 4 | Social support | 0.076514 |
| 5 | Perceptions of corruption | 0.057689 |
| 6 | Generosity | 0.047708 |
| 7 | Mongolia | 0.038714 |
| 8 | Australia | 0.037656 |
| 9 | Canada | 0.035235 |
| 10 | Taiwan Province of China | 0.033808 |
| 11 | Freedom to make life choices | 0.031993 |
| 12 | Japan | 0.021933 |
| 13 | Indonesia | 0.021338 |
| 14 | Thailand | 0.020518 |
| 15 | Negative affect | 0.018886 |
| 16 | United States | 0.018211 |
| 17 | Pakistan | 0.016323 |
| 18 | New Zealand | 0.016167 |
| 19 | South Korea | 0.013238 |
| 20 | Moldova | 0.010937 |
| 21 | Nepal | 0.010531 |
print(f"Atributos sem importância: {len(sem_imp.features.values)}\n\n{sem_imp.features.values}")
Atributos sem importância: 105 ['Zambia' 'Morocco' 'Malta' 'Nicaragua' 'Uruguay' 'Mauritania' 'Uzbekistan' 'Venezuela' 'Montenegro' 'Yemen' 'Mauritius' 'Mozambique' 'Niger' 'Mexico' 'Namibia' 'Sri Lanka' 'Nigeria' 'Uganda' 'Sudan' 'South Africa' 'Mali' 'Swaziland' 'Sierra Leone' 'Serbia' 'Senegal' 'Russia' 'Poland' 'Switzerland' 'Syria' 'Tanzania' 'Togo' 'Paraguay' 'Panama' 'Trinidad and Tobago' 'Norway' 'Spain' 'Tunisia' 'North Macedonia' 'Kenya' 'Malawi' 'Burkina Faso' 'Denmark' 'Czech Republic' 'Cyprus' 'Croatia' 'Costa Rica' 'Congo (Kinshasa)' 'Congo (Brazzaville)' 'Comoros' 'Colombia' 'Chile' 'Chad' 'Central African Republic' 'Cameroon' 'Burundi' 'Bulgaria' 'Madagascar' 'Brazil' 'Botswana' 'Bosnia and Herzegovina' 'Bolivia' 'Bhutan' 'Benin' 'Belize' 'Belgium' 'Bahrain' 'Armenia' 'Argentina' 'Angola' 'Algeria' 'Afghanistan' 'Djibouti' 'Dominican Republic' 'Ecuador' 'Egypt' 'Luxembourg' 'Lithuania' 'Libya' 'Liberia' 'Lesotho' 'Lebanon' 'Latvia' 'Laos' 'Kuwait' 'Jordan' 'Jamaica' 'Ivory Coast' 'Italy' 'Ireland' 'Iraq' 'Iceland' 'Hungary' 'Honduras' 'Haiti' 'Guinea' 'Guatemala' 'Greece' 'Ghana' 'Gambia' 'Gabon' 'France' 'Finland' 'Ethiopia' 'Estonia' 'Zimbabwe']
fi[(fi.features.isin(sem_tgt_paises))]
| features | importances | |
|---|---|---|
| 0 | Logged GDP per capita | 0.098954 |
| 1 | Ladder score | 0.089725 |
| 2 | Healthy life expectancy | 0.086614 |
| 3 | Positive affect | 0.079099 |
| 4 | Social support | 0.076514 |
| 5 | Perceptions of corruption | 0.057689 |
| 6 | Generosity | 0.047708 |
| 11 | Freedom to make life choices | 0.031993 |
| 15 | Negative affect | 0.018886 |
| 44 | year | 0.000922 |
features_paises = fi[~fi.features.isin(sem_tgt_paises)]
print(f"Total de \"features de países\": {features_paises.shape[0]}\nTotal de importância: {features_paises.importances.sum()}")
Total de "features de países": 47 Total de importância: 0.41189564926463523
mf_sem_paises_rf = [*fi[(fi.features.isin(sem_tgt_paises))].features] # melhores features desconsiderando os países
com_paises_rf = [*fi[fi.importances >= 0.0005].features] # desconsiderando os países
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_res[mf_sem_paises_rf], y_res, test_size=0.15, random_state=42)
clf2 = RandomForestClassifier(max_depth=2, random_state=0)
clf2.fit(X_train4, y_train4)
RandomForestClassifier(max_depth=2, random_state=0)
clf2.score(X_test4,y_test4)
0.5774647887323944
print(f"ROC AUC: {round(roc_auc_score(y_test4, clf2.predict_proba(X_test4), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test4, clf2.predict(X_test4), average='weighted'),4)}")
print(classification_report(y_test4, clf2.predict(X_test4)))#, target_names=labels))
ROC AUC: 0.9142
Recall: 0.5775
precision recall f1-score support
Central and Eastern Europe 0.48 0.54 0.51 63
Commonwealth of Independent States 0.56 0.45 0.50 78
East Asia 0.50 0.77 0.60 64
Latin America and Caribbean 0.88 0.70 0.78 60
Middle East and North Africa 0.53 0.29 0.37 59
North America and ANZ 0.60 1.00 0.75 66
South Asia 0.69 0.36 0.47 61
Southeast Asia 0.67 0.56 0.61 57
Sub-Saharan Africa 0.54 0.94 0.69 67
Western Europe 0.41 0.14 0.21 64
accuracy 0.58 639
macro avg 0.59 0.57 0.55 639
weighted avg 0.58 0.58 0.55 639
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_res2, y_res2, test_size=0.35, random_state=42)
cb = CatBoostClassifier(
custom_loss=[metrics.AUCMulticlass()],
random_seed=42,
logging_level='Silent',
iterations=150
)
cb.fit(
X_train2, y_train2,
cat_features=[0],
eval_set=(X_test2, y_test2),
# logging_level='Verbose',
plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
cb.score(X_test2,y_test2)
1.0
cb.best_score_
{'learn': {'MultiClass': 0.010164689602621865},
'validation': {'AUC:type=Mu': 1.0, 'MultiClass': 0.0038862307858915017}}
print(f"ROC AUC: {round(roc_auc_score(y_test2, cb.predict_proba(X_test2), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test2, cb.predict(X_test2), average='weighted'),4)}")
print(classification_report(y_test2, cb.predict(X_test2)))#, target_names=labels))
ROC AUC: 1.0
Recall: 1.0
precision recall f1-score support
Central and Eastern Europe 1.00 1.00 1.00 148
Commonwealth of Independent States 1.00 1.00 1.00 158
East Asia 1.00 1.00 1.00 153
Latin America and Caribbean 1.00 1.00 1.00 125
Middle East and North Africa 1.00 1.00 1.00 149
North America and ANZ 1.00 1.00 1.00 150
South Asia 1.00 1.00 1.00 142
Southeast Asia 1.00 1.00 1.00 156
Sub-Saharan Africa 1.00 1.00 1.00 162
Western Europe 1.00 1.00 1.00 148
accuracy 1.00 1491
macro avg 1.00 1.00 1.00 1491
weighted avg 1.00 1.00 1.00 1491
predictions = cb.predict(df_target2.drop(columns=['Regional indicator']))
predictions_probs = cb.predict_proba(df_target2.drop(columns=['Regional indicator']))
df_target2['região estimada'] = predictions
res2 = df_target2[['Country name','região estimada']]
res2 = pd.concat([res2,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res2
| Country name | região estimada | Regional_indicator_consultado_Major | Regional_indicator_consultado | |
|---|---|---|---|---|
| 36 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 37 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 38 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 39 | Angola | Sub-Saharan Africa | Africa | Middle Africa |
| 173 | Belize | Latin America and Caribbean | Latin America and the Caribbean | Central America |
| 174 | Belize | Sub-Saharan Africa | Latin America and the Caribbean | Central America |
| 188 | Bhutan | Sub-Saharan Africa | Asia | Southern Asia |
| 189 | Bhutan | Sub-Saharan Africa | Asia | Southern Asia |
| 190 | Bhutan | Sub-Saharan Africa | Asia | Southern Asia |
| 331 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 332 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 333 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 334 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 335 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 401 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 402 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 403 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 404 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 405 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 406 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 407 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 408 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 481 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 482 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 483 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 484 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 1682 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1683 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1684 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1685 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1686 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1718 | Syria | Middle East and North Africa | Asia | Western Asia |
| 1719 | Syria | Middle East and North Africa | Asia | Western Asia |
| 1720 | Syria | Middle East and North Africa | Asia | Western Asia |
| 1721 | Syria | Middle East and North Africa | Asia | Western Asia |
| 1722 | Syria | Sub-Saharan Africa | Asia | Western Asia |
| 1723 | Syria | Sub-Saharan Africa | Asia | Western Asia |
| 1724 | Syria | Sub-Saharan Africa | Asia | Western Asia |
| 1797 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1798 | Trinidad and Tobago | Sub-Saharan Africa | Latin America and the Caribbean | Caribbean |
| 1799 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1800 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
| 1801 | Trinidad and Tobago | Latin America and Caribbean | Latin America and the Caribbean | Caribbean |
df_target2.to_csv('descarte_balanceado_catboost.csv')
import joblib
joblib.dump(cb, 'cb0.joblib')
res2.to_csv('catboost_comp_direta.csv')
fi2 = pd.DataFrame({'features':X_train2.columns,'importances':cb.feature_importances_})
fi2.sort_values(by='importances', ascending=False, inplace=True)
fi2.index = range(fi2.shape[0])
sem_imp = fi2[fi2.importances == 0]
fi2 = fi2[fi2.importances > 0]
fi2[fi2.importances > 0.01]
| features | importances | |
|---|---|---|
| 0 | Country name | 58.414788 |
| 1 | Logged GDP per capita | 10.303839 |
| 2 | Positive affect | 9.234681 |
| 3 | Healthy life expectancy | 6.624715 |
| 4 | Generosity | 5.484185 |
| 5 | Ladder score | 4.168928 |
| 6 | Negative affect | 4.115216 |
| 7 | Social support | 0.516048 |
| 8 | Perceptions of corruption | 0.480882 |
| 9 | Freedom to make life choices | 0.464831 |
| 10 | year | 0.191887 |
print(f"Atributos sem importância: {len(sem_imp.features.values)}\n{sem_imp.features.values}")
Atributos sem importância: 0 []
fi2[(fi2.features.isin(sem_tgt_paises))]
| features | importances | |
|---|---|---|
| 1 | Logged GDP per capita | 10.303839 |
| 2 | Positive affect | 9.234681 |
| 3 | Healthy life expectancy | 6.624715 |
| 4 | Generosity | 5.484185 |
| 5 | Ladder score | 4.168928 |
| 6 | Negative affect | 4.115216 |
| 7 | Social support | 0.516048 |
| 8 | Perceptions of corruption | 0.480882 |
| 9 | Freedom to make life choices | 0.464831 |
| 10 | year | 0.191887 |
mf_sem_paises_rf = [*fi2[(fi2.features.isin(sem_tgt_paises))].features] # melhores features desconsiderando os países
com_paises_rf = [*fi2[fi2.importances >= 0.0005].features] # desconsiderando os países
X_train3, X_test3, y_train3, y_test3 = train_test_split(X_res2[mf_sem_paises_rf], y_res2, test_size=0.35, random_state=42)
cb1 = CatBoostClassifier(
custom_loss=[metrics.AUCMulticlass()],
random_seed=42,
logging_level='Silent',
iterations=10
)
cb1.fit(
X_train3, y_train3,
cat_features=[],
eval_set=(X_test3, y_test3),
# logging_level='Verbose',
plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
cb1.score(X_test3,y_test3)
0.846411804158283
cb1.best_score_
{'learn': {'MultiClass': 0.4165059699330663},
'validation': {'AUC:type=Mu': 0.9960896617876441,
'MultiClass': 0.4972461218352189}}
print(f"ROC AUC: {round(roc_auc_score(y_test3, cb1.predict_proba(X_test3), multi_class='ovr'),4)}\nRecall: {round(recall_score(y_test3, cb1.predict(X_test3), average='weighted'),4)}")
print(classification_report(y_test3, cb1.predict(X_test3)))#, target_names=labels))
ROC AUC: 0.9897
Recall: 0.8464
precision recall f1-score support
Central and Eastern Europe 0.79 0.78 0.79 148
Commonwealth of Independent States 0.85 0.81 0.83 158
East Asia 0.87 0.98 0.92 153
Latin America and Caribbean 0.84 0.89 0.86 125
Middle East and North Africa 0.75 0.74 0.75 149
North America and ANZ 0.91 0.96 0.93 150
South Asia 0.82 0.87 0.84 142
Southeast Asia 0.91 0.83 0.87 156
Sub-Saharan Africa 0.89 0.81 0.85 162
Western Europe 0.83 0.80 0.81 148
accuracy 0.85 1491
macro avg 0.85 0.85 0.85 1491
weighted avg 0.85 0.85 0.85 1491
predictions = cb1.predict(df_target2.drop(columns=['Regional indicator','Country name','região estimada']))
predictions_probs = cb1.predict_proba(df_target2.drop(columns=['Regional indicator','Country name','região estimada']))
df_target2['região estimada'] = predictions
res3 = df_target2[['Country name','região estimada']]
res3 = pd.concat([res3,aux_columns[aux_columns.index.isin(df_target.index)]], axis=1)
res3
| Country name | região estimada | Regional_indicator_consultado_Major | Regional_indicator_consultado | |
|---|---|---|---|---|
| 36 | Angola | South Asia | Africa | Middle Africa |
| 37 | Angola | Middle East and North Africa | Africa | Middle Africa |
| 38 | Angola | Middle East and North Africa | Africa | Middle Africa |
| 39 | Angola | Middle East and North Africa | Africa | Middle Africa |
| 173 | Belize | Latin America and Caribbean | Latin America and the Caribbean | Central America |
| 174 | Belize | Latin America and Caribbean | Latin America and the Caribbean | Central America |
| 188 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 189 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 190 | Bhutan | Southeast Asia | Asia | Southern Asia |
| 331 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 332 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 333 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 334 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 335 | Central African Republic | Sub-Saharan Africa | Africa | Middle Africa |
| 401 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 402 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 403 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 404 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 405 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 406 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 407 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 408 | Congo (Kinshasa) | Sub-Saharan Africa | Africa | Middle Africa |
| 481 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 482 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 483 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 484 | Djibouti | Sub-Saharan Africa | Africa | Eastern Africa |
| 1682 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1683 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1684 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1685 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1686 | Sudan | Sub-Saharan Africa | Africa | Northern Africa |
| 1718 | Syria | South Asia | Asia | Western Asia |
| 1719 | Syria | South Asia | Asia | Western Asia |
| 1720 | Syria | Middle East and North Africa | Asia | Western Asia |
| 1721 | Syria | South Asia | Asia | Western Asia |
| 1722 | Syria | South Asia | Asia | Western Asia |
| 1723 | Syria | South Asia | Asia | Western Asia |
| 1724 | Syria | South Asia | Asia | Western Asia |
| 1797 | Trinidad and Tobago | Southeast Asia | Latin America and the Caribbean | Caribbean |
| 1798 | Trinidad and Tobago | Southeast Asia | Latin America and the Caribbean | Caribbean |
| 1799 | Trinidad and Tobago | Southeast Asia | Latin America and the Caribbean | Caribbean |
| 1800 | Trinidad and Tobago | Southeast Asia | Latin America and the Caribbean | Caribbean |
| 1801 | Trinidad and Tobago | Commonwealth of Independent States | Latin America and the Caribbean | Caribbean |
df_target2.to_csv('descarte_balanceado_catboost_sem_paises.csv')
joblib.dump(cb1, 'cb1_sem_paises.joblib')
res2.to_csv('catboost_sem_paises_comp_direta.csv')
df1 = pd.read_csv('df_preenchido_descarte.csv', index_col=0) # dataset com dados excluídos
df1.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)
aux_columns = df1[['Regional_indicator_consultado_Major','Regional_indicator_consultado']]
df1.drop(columns=['Country name','Regional indicator'], inplace=True) # removendo colunas auxiliares
df1
| Regional_indicator_consultado_Major | Regional_indicator_consultado | year | Ladder score | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asia | Southern Asia | 2008 | 3.723590 | 7.370100 | 0.450662 | 50.799999 | 0.718114 | 0.167640 | 0.881686 | 0.517637 | 0.258195 |
| 1 | Asia | Southern Asia | 2009 | 4.401778 | 7.539972 | 0.552308 | 51.200001 | 0.678896 | 0.190099 | 0.850035 | 0.583926 | 0.237092 |
| 2 | Asia | Southern Asia | 2010 | 4.758381 | 7.646709 | 0.539075 | 51.599998 | 0.600127 | 0.120590 | 0.706766 | 0.618265 | 0.275324 |
| 3 | Asia | Southern Asia | 2011 | 3.831719 | 7.619532 | 0.521104 | 51.919998 | 0.495901 | 0.162427 | 0.731109 | 0.611387 | 0.267175 |
| 4 | Asia | Southern Asia | 2012 | 3.782938 | 7.705479 | 0.520637 | 52.240002 | 0.530935 | 0.236032 | 0.775620 | 0.710385 | 0.267919 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2009 | Africa | Eastern Africa | 2017 | 3.638300 | 8.015738 | 0.754147 | 55.000000 | 0.752826 | -0.097645 | 0.751208 | 0.806428 | 0.224051 |
| 2010 | Africa | Eastern Africa | 2018 | 3.616480 | 8.048798 | 0.775388 | 55.599998 | 0.762675 | -0.068427 | 0.844209 | 0.710119 | 0.211726 |
| 2011 | Africa | Eastern Africa | 2019 | 2.693523 | 7.950132 | 0.759162 | 56.200001 | 0.631908 | -0.063791 | 0.830652 | 0.716004 | 0.235354 |
| 2012 | Africa | Eastern Africa | 2020 | 3.159802 | 7.828757 | 0.717243 | 56.799999 | 0.643303 | -0.008696 | 0.788523 | 0.702573 | 0.345736 |
| 2013 | Africa | Eastern Africa | 2021 | 3.144800 | 7.942595 | 0.750470 | 56.200840 | 0.676700 | -0.047346 | 0.820999 | 0.717712 | 0.224420 |
2014 rows × 12 columns
df1[df1['Ladder score'].isna()].shape
(0, 12)
df1.Regional_indicator_consultado.value_counts()
Western Asia 225 Southern Europe 172 Western Africa 172 Eastern Africa 161 South America 157 Eastern Europe 146 Northern Europe 143 South-Eastern Asia 125 Central America 109 Southern Asia 106 Western Europe 99 Middle Africa 69 Central Asia 62 Northern Africa 61 Eastern Asia 60 Southern Africa 45 Caribbean 40 Northern America 32 Australia and New Zealand 30 Name: Regional_indicator_consultado, dtype: int64
cat = np.where( (df1.drop(columns='Ladder score').dtypes != float) & (df1.drop(columns='Ladder score').dtypes != 'int64') )[0]
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=cat)
X_res, y_res = smnc.fit_resample(df1.drop(columns=['Regional_indicator_consultado']), df1['Regional_indicator_consultado'])
X_res = pd.concat([X_res,y_res], axis=1)
df3 = pd.get_dummies(X_res, columns=['Regional_indicator_consultado_Major', 'Regional_indicator_consultado'], prefix='', prefix_sep='', sparse=False, dtype=bool)
X_res = pd.concat([X_res,y_res], axis=1)
df3 = pd.get_dummies(df1, columns=['Regional_indicator_consultado_Major', 'Regional_indicator_consultado'], prefix='', prefix_sep='', sparse=False, dtype=bool)
print(f"amostras antes: {df1.shape[0]}, amostras depois: {df3.shape[0]}")
amostras antes: 2014, amostras depois: 2014
X_train, X_test, y_train, y_test = train_test_split(df3.drop(columns=['Ladder score']), df3['Ladder score'], random_state=42, test_size=0.15)
reg = ExtraTreesRegressor(n_estimators=100, random_state=0).fit(X_train, y_train)
reg.score(X_test, y_test)
0.920583271976963
reg.score(X_test, y_test)
0.920583271976963
y_pred_reg = reg.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.2384 R2: 0.9206 Exp. Variance: 0.9212 Max. Error: 0.9347 MSE: 0.0991
y_pred_reg = reg.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.2384 R2: 0.9206 Exp. Variance: 0.9212 Max. Error: 0.9347 MSE: 0.0991
reg_res = pd.DataFrame({'real':[*y_test], 'estimado':[*y_pred_reg]})
fig = go.Figure()
for column in reg_res:
fig.add_trace(go.Scatter( x=reg_res.index, y=reg_res[column], name = column, mode = 'lines') )
fig.add_trace(go.Scatter( x=reg_res.index, y=(reg_res.real-reg_res.estimado), name='Diff', mode='lines') )
fig.update_layout(title = "Erro de predição", xaxis_title = 'Amostra')
joblib.dump(reg,'extra_trees_ladder_score.joblib')
['extra_trees_ladder_score.joblib']
reg_res.index = range(reg_res.shape[0])
X_test.index = range(X_test.shape[0])
reg_res = pd.concat([reg_res,X_test], axis=1)
reg_res.to_csv('comparcao_direta_ls_extra_trees.csv')
cat = np.where( (df1.drop(columns='Ladder score').dtypes != float) & (df1.drop(columns='Ladder score').dtypes != 'int64') )[0]
smnc = SMOTENC(random_state=42, n_jobs=-1, categorical_features=cat)
X_res, y_res = smnc.fit_resample(df1.drop(columns=['Regional_indicator_consultado']), df1['Regional_indicator_consultado'])
print(f"amostras antes: {df1.shape[0]}, amostras depois: {X_res.shape[0]}")
amostras antes: 2014, amostras depois: 4275
X_res = pd.concat([X_res,y_res], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_res.drop(columns=['Ladder score']), X_res['Ladder score'], random_state=42, test_size=0.15)
cb = CatBoostRegressor(
loss_function='RMSE',
random_seed=42,
logging_level='Silent',
#iterations=150
)
cat = np.where( (X_res.drop(columns='Ladder score').dtypes != float) & (X_res.drop(columns='Ladder score').dtypes != 'int64') )[0]
cb.fit(
X_train, y_train,
cat_features=cat,
eval_set=(X_test, y_test),
# logging_level='Verbose', # you can uncomment this for text output
plot=True
);
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
cb.score(X_test,y_test)
0.9511750848469631
cb.best_score_
{'learn': {'RMSE': 0.12157044081061888},
'validation': {'RMSE': 0.2483869872117922}}
y_pred_reg = cb.predict(X_test)
print(f"MAE: {round(mean_absolute_error(y_test, y_pred_reg),4)}\nR2: {round(r2_score(y_test, y_pred_reg),4)}\nExp. Variance: {round(explained_variance_score(y_test, y_pred_reg),4)}\
\nMax. Error: {round(max_error(y_test, y_pred_reg),4)}\nMSE: {round(mean_squared_error(y_test, y_pred_reg),4)}")
#print(classification_report(y_test, cb1.predict(X_test)))#, target_names=labels))
MAE: 0.1702 R2: 0.9512 Exp. Variance: 0.9512 Max. Error: 1.3694 MSE: 0.0617
reg_res = pd.DataFrame({'real':[*y_test], 'estimado':[*y_pred_reg]})
fig = go.Figure()
for column in reg_res:
fig.add_trace(go.Scatter( x=reg_res.index, y=reg_res[column], name = column, mode = 'lines') )
fig.add_trace(go.Scatter( x=reg_res.index, y=(reg_res.real-reg_res.estimado), name='erro', mode='lines') )
fig.update_layout(title = "Erro de predição", xaxis_title = 'Amostra')
joblib.dump(cb,'catboost_ladder_score.joblib')
['catboost_ladder_score.joblib']
reg_res.index = range(reg_res.shape[0])
X_test.index = range(X_test.shape[0])
reg_res = pd.concat([reg_res,X_test], axis=1)
reg_res.to_csv('comparcao_direta_ls_catboost.csv')